Some questions about the StreamingFileSink

2018-08-21 Thread Benoit MERIAUX
Hi, I have some questions about the new StreamingFileSink in 1.6. My usecase is pretty simple. I have a cassandra table with 150Millions of lines. They are partitioned by buckets of 100 000 lines. My job is to export each "bucket" to a file (1 bucket = 1 file), so the job is degined like this:

Re: Some questions about the StreamingFileSink

2018-08-22 Thread Kostas Kloudas
Hi Benoit, Thanks for using the StreamingFileSink. My answers/explanations are inlined. In most of your observations, you are correct. > On Aug 21, 2018, at 11:45 PM, Benoit MERIAUX wrote: > > Hi, > > I have some questions about the new StreamingFileSink in 1.6. > > My usecase is pretty simpl

Re: Some questions about the StreamingFileSink

2018-08-22 Thread Gyula Fóra
Hi Kostas, Sorry for jumping in on this discussion :) What you suggest for finite sources and waiting for checkpoints is pretty ugly in many cases. Especially if you would otherwise read from a finite source (a file for instance) and want to end the job asap. Would it make sense to not discard a

Re: Some questions about the StreamingFileSink

2018-08-22 Thread Benoit MERIAUX
Thanks for the detailed answer. The actual behavior is correct and due to the legacy which do not make a difference between success and failure when closing the sink. So the workaround is to use a short bucket interval to commit the last received data and wait for the next checkpoint (how do I do i