Whilst working on this application, I found a setting that drastically
improved the performance of my particular Spark Streaming application. I'm
sharing the details in hopes that it may help somebody in a similar
situation.

As my program ingested information into HDFS (as parquet files), I noticed
that the time to process each batch was significantly greater than I
anticipated. Whether I was writing a single parquet file (around 8KB) or
around 10-15 files (8KB each), that step of the processing was taking
around 30 seconds. Once I set the configuration below, this operation
reduced from 30 seconds to around 1 second.

// ssc = instance of SparkStreamingContext
ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata",
"false")

I've also verified that the parquet files being generated are usable by
both Hive and Impala.

Hope that helps!
Kevin

On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com>
wrote:

> I'm attempting to implement a Spark Streaming application that will
> consume application log messages from a message broker and store the
> information in HDFS. During the data ingestion, we apply a custom schema to
> the logs, partition by application name and log date, and then save the
> information as parquet files.
>
> All of this works great, except we end up having a large number of parquet
> files created. It's my understanding that Spark Streaming is unable to
> control the number of files that get generated in each partition; can
> anybody confirm that is true?
>
> Also, has anybody else run into a similar situation regarding data
> ingestion with Spark Streaming and do you have any tips to share? Our end
> goal is to store the information in a way that makes it efficient to query,
> using a tool like Hive or Impala.
>
> Thanks,
> Kevin
>

Reply via email to