Hi,

I have a small doubt. KafkaOffsetGen.java class has a variable called
DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually reading
from Kafka, we take the minimum of sourceLimit and this variable to
actually form the RDD in case of KafkaSource.

I want to know the following -

1. How did we arrive at this number?
2. Why are we hard-coding it? Should we not make it configurable for users
to play around?

For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic
with 1.5 crore events with the following configuration in continuous mode -

1. changed the above variable to Integer.MAX_VALUE.
2. Kept source limit as 3500000 (35 lacs)
3. executor-memory 4g
4. driver-memory 6g

Basically in my case, the RDD was having 35 lac events in one iteration and
it was able to run fine.

If I try running deltaStreamer with a greater value of sourceLimit, then I
was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks like
sort of a sweet spot to run DeltaStreamer.

Reply via email to