Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-05 Thread Sunil Parmar
We use Impala to access parquet files in the directories. Any pointers on achieving at least once semantic with spark streaming or partial files ? Sunil Parmar On Fri, Mar 2, 2018 at 2:57 PM, Tathagata Das wrote: > Structured Streaming's file sink solves these

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Tathagata Das
Structured Streaming's file sink solves these problems by writing a log/manifest of all the authoritative files written out (for any format). So if you run batch or interactive queries on the output directory with Spark, it will automatically read the manifest and only process files are that are

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Sunil Parmar
Is there a way to get finer control over file writing in parquet file writer ? We've an streaming application using Apache Apex ( on path of migration to Spark ...story for a different thread). The existing streaming application read JSON from Kafka and writes Parquet to HDFS. We're trying to

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Tathagata Das
There is no good way to save to parquet without causing downstream consistency issues. You could use foreachRDD to get each RDD, convert it to DataFrame/Dataset, and write out as parquet files. But you will later run into issues with partial files caused by failures, etc. On Wed, Feb 28, 2018 at

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Patrick Alwell
I don’t think sql context is “deprecated” in this sense. It’s still accessible by earlier versions of Spark. But yes, at first glance it looks like you are correct. I don’t see a recordWriter method for parquet outside of the SQL package.

[Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread karthikus
Hi all, I have a Kafka stream data and I need to save the data in parquet format without using Structured Streaming (due to the lack of Kafka Message header support). val kafkaStream = KafkaUtils.createDirectStream( streamingContext,