Re: Spark Streaming Advice

Jörn Franke Mon, 10 Oct 2016 14:56:16 -0700

Your file size is too small this has a significant impact on the namenode. Use 
Hbase or maybe hawq to store small writes.


> On 10 Oct 2016, at 16:25, Kevin Mellott <kevin.r.mell...@gmail.com> wrote:
> 
> Whilst working on this application, I found a setting that drastically 
> improved the performance of my particular Spark Streaming application. I'm 
> sharing the details in hopes that it may help somebody in a similar situation.
> 
> As my program ingested information into HDFS (as parquet files), I noticed 
> that the time to process each batch was significantly greater than I 
> anticipated. Whether I was writing a single parquet file (around 8KB) or 
> around 10-15 files (8KB each), that step of the processing was taking around 
> 30 seconds. Once I set the configuration below, this operation reduced from 
> 30 seconds to around 1 second.
> 
> // ssc = instance of SparkStreamingContext
> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", 
> "false")
> 
> I've also verified that the parquet files being generated are usable by both 
> Hive and Impala.
> 
> Hope that helps!
> Kevin
> 
>> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com> 
>> wrote:
>> I'm attempting to implement a Spark Streaming application that will consume 
>> application log messages from a message broker and store the information in 
>> HDFS. During the data ingestion, we apply a custom schema to the logs, 
>> partition by application name and log date, and then save the information as 
>> parquet files.
>> 
>> All of this works great, except we end up having a large number of parquet 
>> files created. It's my understanding that Spark Streaming is unable to 
>> control the number of files that get generated in each partition; can 
>> anybody confirm that is true? 
>> 
>> Also, has anybody else run into a similar situation regarding data ingestion 
>> with Spark Streaming and do you have any tips to share? Our end goal is to 
>> store the information in a way that makes it efficient to query, using a 
>> tool like Hive or Impala.
>> 
>> Thanks,
>> Kevin
>

Re: Spark Streaming Advice

Reply via email to