On 25 Nov 2015, at 07:01, Michael >
wrote:
so basically writing them into a temporary directory named with the batch time
and then move the files to their destination on success ? I wished there was a
way to skip moving files around and be able to
so basically writing them into a temporary directory named with the
batch time and then move the files to their destination on success ? I
wished there was a way to skip moving files around and be able to set
the output filenames.
Thanks Burak :)
-Michael
On Mon, Nov 23, 2015, at 09:19 PM,
Hi all,
I'm working on project with spark streaming, the goal is to process log
files from S3 and save them on hadoop to later analyze them with
sparkSQL.
Everything works well except when I kill the spark application and
restart it: it picks up from the latest processed batch and reprocesses
it
Not sure if it would be the most efficient, but maybe you can think of the
filesystem as a key value store, and write each batch to a sub-directory,
where the directory name is the batch time. If the directory already
exists, then you shouldn't write it. Then you may have a following batch
job