Pratyaksh,

The default value was chosen based on a “reasonable” payload size and topic 
throughput. 

The number of messages vs executor/driver memory highly depends on your message 
size.
It is already a value that you can configure using “sourceLimit”, like you’ve 
already tried.
Ideally, this number will be tuned by a user depending on the number of 
resources that can be provided vs ingestion latency.

Sent from my iPhone

> On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <pratyaks...@gmail.com> wrote:
> 
> Hi,
> 
> I have a small doubt. KafkaOffsetGen.java class has a variable called
> DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually reading
> from Kafka, we take the minimum of sourceLimit and this variable to
> actually form the RDD in case of KafkaSource.
> 
> I want to know the following -
> 
> 1. How did we arrive at this number?
> 2. Why are we hard-coding it? Should we not make it configurable for users
> to play around?
> 
> For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic
> with 1.5 crore events with the following configuration in continuous mode -
> 
> 1. changed the above variable to Integer.MAX_VALUE.
> 2. Kept source limit as 3500000 (35 lacs)
> 3. executor-memory 4g
> 4. driver-memory 6g
> 
> Basically in my case, the RDD was having 35 lac events in one iteration and
> it was able to run fine.
> 
> If I try running deltaStreamer with a greater value of sourceLimit, then I
> was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks like
> sort of a sweet spot to run DeltaStreamer.

Reply via email to