Spark Streaming - Increasing number of executors slows down processing rate

2017-06-19 Thread Mal Edwin
Hi All,
I am struggling with an odd issue and would like your help in addressing it.

Environment
AWS Cluster (40 Spark Nodes & 4 node Kafka cluster)
Spark Kafka Streaming submitted in Yarn cluster mode
Kafka - Single topic, 400 partitions
Spark 2.1 on Cloudera
Kafka 10.0 on Cloudera

We have zero messages in Kafka and starting this spark job with 100 Executors 
each with 14GB of RAM and single executor core.
The time to process 0 records(end of each batch) is 5seconds

When we increase the executors to 400 and everything else remains the same 
except we reduce memory to 11GB, we see the time to process 0 records(end of 
each batch) increases 10times to  50Second and some cases it goes to 103 
seconds.

Spark Streaming configs that we are setting are
Batchwindow = 60 seconds
Backpressure.enabled = true
spark.memory.fraction=0.3 (we store more data in our own data structures)
spark.streaming.kafka.consumer.poll.ms=1

Have tried increasing driver memory to 4GB and also increased driver.cores to 4.

If anybody has faced similar issues please provide some pointers to how to 
address this issue.

Thanks a lot for your time.

Regards,
Edwin



Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-18 Thread Mal Edwin

Hi,
You can enable backpressure to handle this.

spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate

Thanks,
Edwin

On Mar 18, 2017, 12:53 AM -0400, sagarcasual . , wrote:
> Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct 
> approach. The streaming part works fine but when we initially start the job, 
> we have to deal with really huge Kafka message backlog, millions of messages, 
> and that first batch runs for over 40 hours,  and after 12 hours or so it 
> becomes very very slow, it keeps crunching messages, but at a very low speed. 
> Any idea how to overcome this issue? Once the job is all caught up, 
> subsequent batches are quick and fast since the load is really tiny to 
> process. So any idea how to avoid this problem?




RE: RE: Fast write datastore...

2017-03-16 Thread Mal Edwin
Hi All,
I believe here what we are looking for is a serving layer where user queries 
can be executed on a subset of processed data.
In this scenario, we are using Impala for this as it provides a layered 
caching, in our use case it caches some set in memory and then some in HDFS and 
the full set is on S3.

Our processing layer is SparkStreaming + HBase  —> extracts to Parquet on S3 —> 
Impala is serving layer serving user requests. Impala also has a SQL interface. 
Drawback is Impala is not managed via Yarn and has its own resource manager and 
you would have to figure out a way to man Yarn and impala co-exist.

Thanks,
Edwin

On Mar 16, 2017, 5:44 AM -0400, yohann jardin , wrote:
> Hello everyone,
>
> I'm also really interested in the answers as I will be facing the same issue 
> soon.
> Muthu, if you evaluate again Apache Ignite, can you share your results? I 
> also noticed Alluxio to store spark results in memory that you might want to 
> investigate.
>
> In my case I want to use them to have a real time dashboard (or like waiting 
> very few seconds to refine a dashboard), and that use case seems similar to 
> your filter/aggregate previously computed spark results.
>
> Regards,
> Yohann
>
> De : Rick Moritz 
> Envoyé : jeudi 16 mars 2017 10:37
> À : user
> Objet : Re: RE: Fast write datastore...
>
> If you have enough RAM/SSDs available, maybe tiered HDFS storage and Parquet 
> might also be an option. Of course, management-wise it has much more overhead 
> than using ES, since you need to manually define partitions and buckets, 
> which is suboptimal. On the other hand, for querying, you can probably get 
> some decent performance by hooking up Impala or Presto or LLAP-Hive, if Spark 
> were too slow/cumbersome.
> Depending on your particular access patterns, this may not be very practical, 
> but as a general approach it might be one way to get intermediate results 
> quicker, and with less of a storage-zoo than some alternatives.
>
> > On Thu, Mar 16, 2017 at 7:57 AM, Shiva Ramagopal  wrote:
> > > I do think Kafka is an overkill in this case. There are no streaming use- 
> > > cases that needs a queue to do pub-sub.
> > >
> > > > On 16-Mar-2017 11:47 AM, "vvshvv"  wrote:
> > > > > Hi,
> > > > >
> > > > > >> A slightly over-kill solution may be Spark to Kafka to 
> > > > > >> ElasticSearch?
> > > > >
> > > > > I do not think so, in this case you will be able to process Parquet 
> > > > > files as usual, but Kafka will allow your Elasticsearch cluster to be 
> > > > > stable and survive regarding the number of rows.
> > > > >
> > > > > Regards,
> > > > > Uladzimir
> > > > >
> > > > >
> > > > >
> > > > > On jasbir.s...@accenture.com, Mar 16, 2017 7:52 AM wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Will MongoDB not fit this solution?
> > > > > >
> > > > > >
> > > > > >
> > > > > > From: Vova Shelgunov [mailto:vvs...@gmail.com]
> > > > > > Sent: Wednesday, March 15, 2017 11:51 PM
> > > > > > To: Muthu Jayakumar 
> > > > > > Cc: vincent gromakowski ; Richard 
> > > > > > Siebeling ; user ; 
> > > > > > Shiva Ramagopal 
> > > > > > Subject: Re: Fast write datastore...
> > > > > >
> > > > > > Hi Muthu,.
> > > > > >
> > > > > > I did not catch from your message, what performance do you expect 
> > > > > > from subsequent queries?
> > > > > >
> > > > > > Regards,
> > > > > > Uladzimir
> > > > > >
> > > > > > On Mar 15, 2017 9:03 PM, "Muthu Jayakumar"  
> > > > > > wrote:
> > > > > > > Hello Uladzimir / Shiva,
> > > > > > >
> > > > > > > From ElasticSearch documentation (i have to see the logical plan 
> > > > > > > of a query to confirm), the richness of filters (like regex,..) 
> > > > > > > is pretty good while comparing to Cassandra. As for aggregates, i 
> > > > > > > think Spark Dataframes is quite rich enough to tackle.
> > > > > > > Let me know your thoughts.
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Muthu
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Mar 15, 2017 at 10:55 AM, vvshvv  wrote:
> > > > > > > > Hi muthu,
> > > > > > > >
> > > > > > > > I agree with Shiva, Cassandra also supports SASI indexes, which 
> > > > > > > > can partially replace Elasticsearch functionality.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Uladzimir
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Sent from my Mi phone
> > > > > > > > On Shiva Ramagopal , Mar 15, 2017 5:57 PM 
> > > > > > > > wrote:
> > > > > > > > > Probably Cassandra is a good choice if you are mainly looking 
> > > > > > > > > for a datastore that supports fast writes. You can ingest the 
> > > > > > > > > data into a table and define one or more materialized views 
> > > > > > > > > on top of it to support your queries. Since you mention that 
> > > > >