The batch interval was set to 30 seconds; however, after getting the
parquet files to save faster I lowered the interval to 10 seconds. The
number of log messages contained in each batch varied from just a few up to
around 3500, with the number of partitions ranging from 1 to around 15.

I will have to check out HBase as well; I've heard good things!

Thanks,
Kevin

On Mon, Oct 10, 2016 at 11:38 AM, Mich Talebzadeh <mich.talebza...@gmail.com
> wrote:

> Hi Kevin,
>
> What is the streaming interval (batch interval) above?
>
> I do analytics on streaming trade data but after manipulation of
> individual messages I store the selected on in Hbase. Very fast.
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 10 October 2016 at 15:25, Kevin Mellott <kevin.r.mell...@gmail.com>
> wrote:
>
>> Whilst working on this application, I found a setting that drastically
>> improved the performance of my particular Spark Streaming application. I'm
>> sharing the details in hopes that it may help somebody in a similar
>> situation.
>>
>> As my program ingested information into HDFS (as parquet files), I
>> noticed that the time to process each batch was significantly greater than
>> I anticipated. Whether I was writing a single parquet file (around 8KB) or
>> around 10-15 files (8KB each), that step of the processing was taking
>> around 30 seconds. Once I set the configuration below, this operation
>> reduced from 30 seconds to around 1 second.
>>
>> // ssc = instance of SparkStreamingContext
>> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata",
>> "false")
>>
>> I've also verified that the parquet files being generated are usable by
>> both Hive and Impala.
>>
>> Hope that helps!
>> Kevin
>>
>> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com>
>> wrote:
>>
>>> I'm attempting to implement a Spark Streaming application that will
>>> consume application log messages from a message broker and store the
>>> information in HDFS. During the data ingestion, we apply a custom schema to
>>> the logs, partition by application name and log date, and then save the
>>> information as parquet files.
>>>
>>> All of this works great, except we end up having a large number of
>>> parquet files created. It's my understanding that Spark Streaming is unable
>>> to control the number of files that get generated in each partition; can
>>> anybody confirm that is true?
>>>
>>> Also, has anybody else run into a similar situation regarding data
>>> ingestion with Spark Streaming and do you have any tips to share? Our end
>>> goal is to store the information in a way that makes it efficient to query,
>>> using a tool like Hive or Impala.
>>>
>>> Thanks,
>>> Kevin
>>>
>>
>>
>

Reply via email to