Re: Spark Streaming idempotent writes to HDFS

2015-11-25 Thread Steve Loughran
On 25 Nov 2015, at 07:01, Michael > wrote: so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to

Re: Spark Streaming idempotent writes to HDFS

2015-11-24 Thread Michael
so basically writing them into a temporary directory named with the batch time and then move the files to their destination on success ? I wished there was a way to skip moving files around and be able to set the output filenames. Thanks Burak :) -Michael On Mon, Nov 23, 2015, at 09:19 PM,

Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Michael
Hi all, I'm working on project with spark streaming, the goal is to process log files from S3 and save them on hadoop to later analyze them with sparkSQL. Everything works well except when I kill the spark application and restart it: it picks up from the latest processed batch and reprocesses it

Re: Spark Streaming idempotent writes to HDFS

2015-11-23 Thread Burak Yavuz
Not sure if it would be the most efficient, but maybe you can think of the filesystem as a key value store, and write each batch to a sub-directory, where the directory name is the batch time. If the directory already exists, then you shouldn't write it. Then you may have a following batch job