Thanks Chris for looking at this. I was putting data at roughly the same 50
records per batch max. This issue was purely because of a bug in my
persistence logic that was leaking memory.

Overall, I haven't seen a lot of lag with kinesis + spark setup and I am
able to process records at roughly the same rate as data as fed into
kinesis with acceptable latency.

Thanks,
Aniket
On Oct 31, 2014 1:15 AM, "Chris Fregly" <ch...@fregly.com> wrote:

> curious about why you're only seeing 50 records max per batch.
>
> how many receivers are you running?  what is the rate that you're putting
> data onto the stream?
>
> per the default AWS kinesis configuration, the producer can do 1000 PUTs
> per second with max 50k bytes per PUT and max 1mb per second per shard.
>
> on the consumer side, you can only do 5 GETs per second and 2mb per second
> per shard.
>
> my hunch is that the 5 GETs per second is what's limiting your consumption
> rate.
>
> can you verify that these numbers match what you're seeing?  if so, you
> may want to increase your shards and therefore the number of kinesis
> receivers.
>
> otherwise, this may require some further investigation on my part.  i
> wanna stay on top of this if it's an issue.
>
> thanks for posting this, aniket!
>
> -chris
>
> On Fri, Sep 12, 2014 at 5:34 AM, Aniket Bhatnagar <
> aniket.bhatna...@gmail.com> wrote:
>
>> Hi all
>>
>> Sorry but this was totally my mistake. In my persistence logic, I was
>> creating async http client instance in RDD foreach but was never closing it
>> leading to memory leaks.
>>
>> Apologies for wasting everyone's time.
>>
>> Thanks,
>> Aniket
>>
>> On 12 September 2014 02:20, Tathagata Das <tathagata.das1...@gmail.com>
>> wrote:
>>
>>> Which version of spark are you running?
>>>
>>> If you are running the latest one, then could try running not a window
>>> but a simple event count on every 2 second batch, and see if you are still
>>> running out of memory?
>>>
>>> TD
>>>
>>>
>>> On Thu, Sep 11, 2014 at 10:34 AM, Aniket Bhatnagar <
>>> aniket.bhatna...@gmail.com> wrote:
>>>
>>>> I did change it to be 1 gb. It still ran out of memory but a little
>>>> later.
>>>>
>>>> The streaming job isnt handling a lot of data. In every 2 seconds, it
>>>> doesn't get more than 50 records. Each record size is not more than 500
>>>> bytes.
>>>>  On Sep 11, 2014 10:54 PM, "Bharat Venkat" <bvenkat.sp...@gmail.com>
>>>> wrote:
>>>>
>>>>> You could set "spark.executor.memory" to something bigger than the
>>>>> default (512mb)
>>>>>
>>>>>
>>>>> On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
>>>>> aniket.bhatna...@gmail.com> wrote:
>>>>>
>>>>>> I am running a simple Spark Streaming program that pulls in data from
>>>>>> Kinesis at a batch interval of 10 seconds, windows it for 10 seconds, 
>>>>>> maps
>>>>>> data and persists to a store.
>>>>>>
>>>>>> The program is running in local mode right now and runs out of memory
>>>>>> after a while. I am yet to investigate heap dumps but I think Spark isn't
>>>>>> releasing memory after processing is complete. I have even tried changing
>>>>>> storage level to disk only.
>>>>>>
>>>>>> Help!
>>>>>>
>>>>>> Thanks,
>>>>>> Aniket
>>>>>>
>>>>>
>>>>>
>>>
>>
>

Reply via email to