The batch interval was set to 30 seconds; however, after getting the parquet files to save faster I lowered the interval to 10 seconds. The number of log messages contained in each batch varied from just a few up to around 3500, with the number of partitions ranging from 1 to around 15.
I will have to check out HBase as well; I've heard good things! Thanks, Kevin On Mon, Oct 10, 2016 at 11:38 AM, Mich Talebzadeh <mich.talebza...@gmail.com > wrote: > Hi Kevin, > > What is the streaming interval (batch interval) above? > > I do analytics on streaming trade data but after manipulation of > individual messages I store the selected on in Hbase. Very fast. > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > On 10 October 2016 at 15:25, Kevin Mellott <kevin.r.mell...@gmail.com> > wrote: > >> Whilst working on this application, I found a setting that drastically >> improved the performance of my particular Spark Streaming application. I'm >> sharing the details in hopes that it may help somebody in a similar >> situation. >> >> As my program ingested information into HDFS (as parquet files), I >> noticed that the time to process each batch was significantly greater than >> I anticipated. Whether I was writing a single parquet file (around 8KB) or >> around 10-15 files (8KB each), that step of the processing was taking >> around 30 seconds. Once I set the configuration below, this operation >> reduced from 30 seconds to around 1 second. >> >> // ssc = instance of SparkStreamingContext >> ssc.sparkContext.hadoopConfiguration.set("parquet.enable.summary-metadata", >> "false") >> >> I've also verified that the parquet files being generated are usable by >> both Hive and Impala. >> >> Hope that helps! >> Kevin >> >> On Thu, Oct 6, 2016 at 4:22 PM, Kevin Mellott <kevin.r.mell...@gmail.com> >> wrote: >> >>> I'm attempting to implement a Spark Streaming application that will >>> consume application log messages from a message broker and store the >>> information in HDFS. During the data ingestion, we apply a custom schema to >>> the logs, partition by application name and log date, and then save the >>> information as parquet files. >>> >>> All of this works great, except we end up having a large number of >>> parquet files created. It's my understanding that Spark Streaming is unable >>> to control the number of files that get generated in each partition; can >>> anybody confirm that is true? >>> >>> Also, has anybody else run into a similar situation regarding data >>> ingestion with Spark Streaming and do you have any tips to share? Our end >>> goal is to store the information in a way that makes it efficient to query, >>> using a tool like Hive or Impala. >>> >>> Thanks, >>> Kevin >>> >> >> >