Your file size is too small this has a significant impact on the namenode. Use Hbase or maybe hawq to store small writes.
> On 10 Oct 2016, at 16:25, Kevin Mellott <kevin.r.mell...@gmail.com> wrote: > > Whilst working on this application, I found a setting that drastically > improved the performance of my particular Spark Streaming application. I'm > sharing the details in hopes that it may help somebody in a similar situation. > > As my program ingested information into HDFS (as parquet files), I noticed > that the time to process each batch was significantly greater than I > anticipated. Whether I was writing a single parquet file (around 8KB) or > around 10-15 files (8KB each), that step of the processing was taking around > 30 seconds. Once I set the configuration below, this operation reduced from > 30 seconds to around 1 second. > > // ssc = instance of SparkStreamingContext > ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", > "false") > > I've also verified that the parquet files being generated are usable by both > Hive and Impala. > > Hope that helps! > Kevin > >> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com> >> wrote: >> I'm attempting to implement a Spark Streaming application that will consume >> application log messages from a message broker and store the information in >> HDFS. During the data ingestion, we apply a custom schema to the logs, >> partition by application name and log date, and then save the information as >> parquet files. >> >> All of this works great, except we end up having a large number of parquet >> files created. It's my understanding that Spark Streaming is unable to >> control the number of files that get generated in each partition; can >> anybody confirm that is true? >> >> Also, has anybody else run into a similar situation regarding data ingestion >> with Spark Streaming and do you have any tips to share? Our end goal is to >> store the information in a way that makes it efficient to query, using a >> tool like Hive or Impala. >> >> Thanks, >> Kevin >