Re: retention policy for spark structured streaming dataset

2018-03-14 Thread Sunil Parmar
Can you use partitioning ( by day ) ? That will make it easier to drop data older than x days outside streaming job. Sunil Parmar On Wed, Mar 14, 2018 at 11:36 AM, Lian Jiang <jiangok2...@gmail.com> wrote: > I have a spark structured streaming job which dump data into a parqu

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-05 Thread Sunil Parmar
We use Impala to access parquet files in the directories. Any pointers on achieving at least once semantic with spark streaming or partial files ? Sunil Parmar On Fri, Mar 2, 2018 at 2:57 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: > Structured Streaming's file si

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Sunil Parmar
to deal with partial files by writing .tmp files and renaming them as the last step. We only commit offset after rename is successful. This way we get at least once semantic and partial file write issue. Thoughts ? Sunil Parmar On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das <tathagata.d