Concurrent Writes :D

The magic number 1M is from me actually :) . and there is no magic, it was
picked to keep jobs from batch scanning Kafka since source-limit default
was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
larger.
Happy to take a PR, to make this limit higher (say 10M) and only use it
when sourceLimit is infinity? Interested in contributing your change back?

On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma <[email protected]>
wrote:

> Hi Nishith,
>
> I would like to know more about the reasonable payload size and topic
> throughput that you mentioned. :)
> Can you tell me few numbers around these two parameters which were involved
> in deciding default value as 1000000.
>
> On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote:
>
> > Pratyaksh,
> >
> > The default value was chosen based on a “reasonable” payload size and
> > topic throughput.
> >
> > The number of messages vs executor/driver memory highly depends on your
> > message size.
> > It is already a value that you can configure using “sourceLimit”, like
> > you’ve already tried.
> > Ideally, this number will be tuned by a user depending on the number of
> > resources that can be provided vs ingestion latency.
> >
> > Sent from my iPhone
> >
> > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <[email protected]>
> > wrote:
> > >
> > > Hi,
> > >
> > > I have a small doubt. KafkaOffsetGen.java class has a variable called
> > > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually
> reading
> > > from Kafka, we take the minimum of sourceLimit and this variable to
> > > actually form the RDD in case of KafkaSource.
> > >
> > > I want to know the following -
> > >
> > > 1. How did we arrive at this number?
> > > 2. Why are we hard-coding it? Should we not make it configurable for
> > users
> > > to play around?
> > >
> > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka
> topic
> > > with 1.5 crore events with the following configuration in continuous
> > mode -
> > >
> > > 1. changed the above variable to Integer.MAX_VALUE.
> > > 2. Kept source limit as 3500000 (35 lacs)
> > > 3. executor-memory 4g
> > > 4. driver-memory 6g
> > >
> > > Basically in my case, the RDD was having 35 lac events in one iteration
> > and
> > > it was able to run fine.
> > >
> > > If I try running deltaStreamer with a greater value of sourceLimit,
> then
> > I
> > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks
> > like
> > > sort of a sweet spot to run DeltaStreamer.
> >
>

Reply via email to