Hi Sebastian,
Yes... my mistake... you are right. Every partition will create a different
file.
Shridhar
On Fri, Mar 25, 2016 at 6:58 PM, Sebastian Piu
wrote:
> I dont understand about the race condition comment you mention.
> Have you seen this somewhere? That
I dont understand about the race condition comment you mention.
Have you seen this somewhere? That timestamp will be the same on each
worker for that rdd, and each worker is handling a different partition
which will be reflected on the filename, so no data will be overwriting. In
fact this is
Hi Surendra,
Thanks for your suggestion. I tried MultipleOutputForma and
MultipleTextOutputFormat. But the result was the same. The folder would
always contain a single file part-r-0, and this file gets overwritten
everytime.
This is how I am invoking the API
Hi Sebastian,
Thanks for your reply. I think using rdd.timestamp may cause one issue. If
the parallelized RDD is executed on more than one workers, there may be a
race condition if rdd.timestamp is used. It may also result in overwriting
the file.
Shridhar
On Wed, Mar 23, 2016 at 12:32 AM,
Hi Vetal,
You may try with MultiOutPutFormat instead of TextOutPutFormat in
saveAsNewAPIHadoopFile().
Regards,
Surendra M
-- Surendra Manchikanti
On Tue, Mar 22, 2016 at 10:26 AM, vetal king wrote:
> We are using Spark 1.4 for Spark Streaming. Kafka is data source for
As you said, create a folder for each different minute, you can use the
rdd.time also as a timestamp.
Also you might want to have a look at the window function for the batching
On Tue, 22 Mar 2016, 17:43 vetal king, wrote:
> Hi Cody,
>
> Thanks for your reply.
>
> Five
Hi Cody,
Thanks for your reply.
Five seconds batch and one min publishing interval is just a representative
example. What we want is, to group data over a certain frequency. That
frequency is configurable. One way we think it can be achieved is
"directory" will be created per this frequency,
If you want 1 minute granularity, why not use a 1 minute batch time?
Also, HDFS is not a great match for this kind of thing, because of the
small files issue.
On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote:
> We are using Spark 1.4 for Spark Streaming. Kafka is data
We are using Spark 1.4 for Spark Streaming. Kafka is data source for the
Spark Stream.
Records are published on Kafka every second. Our requirement is to store
records published on Kafka in a single folder per minute. The stream will
read records every five seconds. For instance records published