Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread vetal king
Hi Sebastian, Yes... my mistake... you are right. Every partition will create a different file. Shridhar On Fri, Mar 25, 2016 at 6:58 PM, Sebastian Piu wrote: > I dont understand about the race condition comment you mention. > Have you seen this somewhere? That

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread Sebastian Piu
I dont understand about the race condition comment you mention. Have you seen this somewhere? That timestamp will be the same on each worker for that rdd, and each worker is handling a different partition which will be reflected on the filename, so no data will be overwriting. In fact this is

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread vetal king
Hi Surendra, Thanks for your suggestion. I tried MultipleOutputForma and MultipleTextOutputFormat. But the result was the same. The folder would always contain a single file part-r-0, and this file gets overwritten everytime. This is how I am invoking the API

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread vetal king
Hi Sebastian, Thanks for your reply. I think using rdd.timestamp may cause one issue. If the parallelized RDD is executed on more than one workers, there may be a race condition if rdd.timestamp is used. It may also result in overwriting the file. Shridhar On Wed, Mar 23, 2016 at 12:32 AM,

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-23 Thread Surendra , Manchikanti
Hi Vetal, You may try with MultiOutPutFormat instead of TextOutPutFormat in saveAsNewAPIHadoopFile(). Regards, Surendra M -- Surendra Manchikanti On Tue, Mar 22, 2016 at 10:26 AM, vetal king wrote: > We are using Spark 1.4 for Spark Streaming. Kafka is data source for

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread Sebastian Piu
As you said, create a folder for each different minute, you can use the rdd.time also as a timestamp. Also you might want to have a look at the window function for the batching On Tue, 22 Mar 2016, 17:43 vetal king, wrote: > Hi Cody, > > Thanks for your reply. > > Five

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread vetal king
Hi Cody, Thanks for your reply. Five seconds batch and one min publishing interval is just a representative example. What we want is, to group data over a certain frequency. That frequency is configurable. One way we think it can be achieved is "directory" will be created per this frequency,

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread Cody Koeninger
If you want 1 minute granularity, why not use a 1 minute batch time? Also, HDFS is not a great match for this kind of thing, because of the small files issue. On Tue, Mar 22, 2016 at 12:26 PM, vetal king wrote: > We are using Spark 1.4 for Spark Streaming. Kafka is data

Problem using saveAsNewAPIHadoopFile API

2016-03-22 Thread vetal king
We are using Spark 1.4 for Spark Streaming. Kafka is data source for the Spark Stream. Records are published on Kafka every second. Our requirement is to store records published on Kafka in a single folder per minute. The stream will read records every five seconds. For instance records published