Thanks Chris for looking at this. I was putting data at roughly the same 50 records per batch max. This issue was purely because of a bug in my persistence logic that was leaking memory.
Overall, I haven't seen a lot of lag with kinesis + spark setup and I am able to process records at roughly the same rate as data as fed into kinesis with acceptable latency. Thanks, Aniket On Oct 31, 2014 1:15 AM, "Chris Fregly" <ch...@fregly.com> wrote: > curious about why you're only seeing 50 records max per batch. > > how many receivers are you running? what is the rate that you're putting > data onto the stream? > > per the default AWS kinesis configuration, the producer can do 1000 PUTs > per second with max 50k bytes per PUT and max 1mb per second per shard. > > on the consumer side, you can only do 5 GETs per second and 2mb per second > per shard. > > my hunch is that the 5 GETs per second is what's limiting your consumption > rate. > > can you verify that these numbers match what you're seeing? if so, you > may want to increase your shards and therefore the number of kinesis > receivers. > > otherwise, this may require some further investigation on my part. i > wanna stay on top of this if it's an issue. > > thanks for posting this, aniket! > > -chris > > On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar < > aniket.bhatna...@gmail.com> wrote: > >> Hi all >> >> Sorry but this was totally my mistake. In my persistence logic, I was >> creating async http client instance in RDD foreach but was never closing it >> leading to memory leaks. >> >> Apologies for wasting everyone's time. >> >> Thanks, >> Aniket >> >> On 12 September 2014 02:20, Tathagata Das <tathagata.das1...@gmail.com> >> wrote: >> >>> Which version of spark are you running? >>> >>> If you are running the latest one, then could try running not a window >>> but a simple event count on every 2 second batch, and see if you are still >>> running out of memory? >>> >>> TD >>> >>> >>> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar < >>> aniket.bhatna...@gmail.com> wrote: >>> >>>> I did change it to be 1 gb. It still ran out of memory but a little >>>> later. >>>> >>>> The streaming job isnt handling a lot of data. In every 2 seconds, it >>>> doesn't get more than 50 records. Each record size is not more than 500 >>>> bytes. >>>> On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bvenkat.sp...@gmail.com> >>>> wrote: >>>> >>>>> You could set "spark.executor.memory" to something bigger than the >>>>> default (512mb) >>>>> >>>>> >>>>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar < >>>>> aniket.bhatna...@gmail.com> wrote: >>>>> >>>>>> I am running a simple Spark Streaming program that pulls in data from >>>>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, >>>>>> maps >>>>>> data and persists to a store. >>>>>> >>>>>> The program is running in local mode right now and runs out of memory >>>>>> after a while. I am yet to investigate heap dumps but I think Spark isn't >>>>>> releasing memory after processing is complete. I have even tried changing >>>>>> storage level to disk only. >>>>>> >>>>>> Help! >>>>>> >>>>>> Thanks, >>>>>> Aniket >>>>>> >>>>> >>>>> >>> >> >