Whilst working on this application, I found a setting that drastically improved the performance of my particular Spark Streaming application. I'm sharing the details in hopes that it may help somebody in a similar situation.
As my program ingested information into HDFS (as parquet files), I noticed that the time to process each batch was significantly greater than I anticipated. Whether I was writing a single parquet file (around 8KB) or around 10-15 files (8KB each), that step of the processing was taking around 30 seconds. Once I set the configuration below, this operation reduced from 30 seconds to around 1 second. // ssc = instance of SparkStreamingContext ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", "false") I've also verified that the parquet files being generated are usable by both Hive and Impala. Hope that helps! Kevin On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > I'm attempting to implement a Spark Streaming application that will > consume application log messages from a message broker and store the > information in HDFS. During the data ingestion, we apply a custom schema to > the logs, partition by application name and log date, and then save the > information as parquet files. > > All of this works great, except we end up having a large number of parquet > files created. It's my understanding that Spark Streaming is unable to > control the number of files that get generated in each partition; can > anybody confirm that is true? > > Also, has anybody else run into a similar situation regarding data > ingestion with Spark Streaming and do you have any tips to share? Our end > goal is to store the information in a way that makes it efficient to query, > using a tool like Hive or Impala. > > Thanks, > Kevin >