We use Impala to access parquet files in the directories. Any pointers on
achieving at least once semantic with spark streaming or partial files ?
Sunil Parmar
On Fri, Mar 2, 2018 at 2:57 PM, Tathagata Das
wrote:
> Structured Streaming's file sink solves these
Structured Streaming's file sink solves these problems by writing a
log/manifest of all the authoritative files written out (for any format).
So if you run batch or interactive queries on the output directory with
Spark, it will automatically read the manifest and only process files are
that are
Is there a way to get finer control over file writing in parquet file
writer ?
We've an streaming application using Apache Apex ( on path of migration to
Spark ...story for a different thread). The existing streaming application
read JSON from Kafka and writes Parquet to HDFS. We're trying to
There is no good way to save to parquet without causing downstream
consistency issues.
You could use foreachRDD to get each RDD, convert it to DataFrame/Dataset,
and write out as parquet files. But you will later run into issues with
partial files caused by failures, etc.
On Wed, Feb 28, 2018 at
I don’t think sql context is “deprecated” in this sense. It’s still accessible
by earlier versions of Spark.
But yes, at first glance it looks like you are correct. I don’t see a
recordWriter method for parquet outside of the SQL package.
Hi all,
I have a Kafka stream data and I need to save the data in parquet format
without using Structured Streaming (due to the lack of Kafka Message header
support).
val kafkaStream =
KafkaUtils.createDirectStream(
streamingContext,