Hi Nishith,

I would like to know more about the reasonable payload size and topic
throughput that you mentioned. :)
Can you tell me few numbers around these two parameters which were involved
in deciding default value as 1000000.

On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote:

> Pratyaksh,
>
> The default value was chosen based on a “reasonable” payload size and
> topic throughput.
>
> The number of messages vs executor/driver memory highly depends on your
> message size.
> It is already a value that you can configure using “sourceLimit”, like
> you’ve already tried.
> Ideally, this number will be tuned by a user depending on the number of
> resources that can be provided vs ingestion latency.
>
> Sent from my iPhone
>
> > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <[email protected]>
> wrote:
> >
> > Hi,
> >
> > I have a small doubt. KafkaOffsetGen.java class has a variable called
> > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually reading
> > from Kafka, we take the minimum of sourceLimit and this variable to
> > actually form the RDD in case of KafkaSource.
> >
> > I want to know the following -
> >
> > 1. How did we arrive at this number?
> > 2. Why are we hard-coding it? Should we not make it configurable for
> users
> > to play around?
> >
> > For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic
> > with 1.5 crore events with the following configuration in continuous
> mode -
> >
> > 1. changed the above variable to Integer.MAX_VALUE.
> > 2. Kept source limit as 3500000 (35 lacs)
> > 3. executor-memory 4g
> > 4. driver-memory 6g
> >
> > Basically in my case, the RDD was having 35 lac events in one iteration
> and
> > it was able to run fine.
> >
> > If I try running deltaStreamer with a greater value of sourceLimit, then
> I
> > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks
> like
> > sort of a sweet spot to run DeltaStreamer.
>

Reply via email to