Thanks!

On Sun, Nov 17, 2019 at 4:59 AM Pratyaksh Sharma <[email protected]>
wrote:

> https://issues.apache.org/jira/browse/HUDI-340 tracks this.
>
> On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma <[email protected]>
> wrote:
>
> > Yeah,
> >
> > Would love to do that. Will create a jira and raise a PR.
> >
> > On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar <[email protected]>
> wrote:
> >
> >> Concurrent Writes :D
> >>
> >> The magic number 1M is from me actually :) . and there is no magic, it
> was
> >> picked to keep jobs from batch scanning Kafka since source-limit default
> >> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
> >> larger.
> >> Happy to take a PR, to make this limit higher (say 10M) and only use it
> >> when sourceLimit is infinity? Interested in contributing your change
> back?
> >>
> >> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma <[email protected]
> >
> >> wrote:
> >>
> >> > Hi Nishith,
> >> >
> >> > I would like to know more about the reasonable payload size and topic
> >> > throughput that you mentioned. :)
> >> > Can you tell me few numbers around these two parameters which were
> >> involved
> >> > in deciding default value as 1000000.
> >> >
> >> > On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote:
> >> >
> >> > > Pratyaksh,
> >> > >
> >> > > The default value was chosen based on a “reasonable” payload size
> and
> >> > > topic throughput.
> >> > >
> >> > > The number of messages vs executor/driver memory highly depends on
> >> your
> >> > > message size.
> >> > > It is already a value that you can configure using “sourceLimit”,
> like
> >> > > you’ve already tried.
> >> > > Ideally, this number will be tuned by a user depending on the number
> >> of
> >> > > resources that can be provided vs ingestion latency.
> >> > >
> >> > > Sent from my iPhone
> >> > >
> >> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <
> >> [email protected]>
> >> > > wrote:
> >> > > >
> >> > > > Hi,
> >> > > >
> >> > > > I have a small doubt. KafkaOffsetGen.java class has a variable
> >> called
> >> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually
> >> > reading
> >> > > > from Kafka, we take the minimum of sourceLimit and this variable
> to
> >> > > > actually form the RDD in case of KafkaSource.
> >> > > >
> >> > > > I want to know the following -
> >> > > >
> >> > > > 1. How did we arrive at this number?
> >> > > > 2. Why are we hard-coding it? Should we not make it configurable
> for
> >> > > users
> >> > > > to play around?
> >> > > >
> >> > > > For bootstrapping purpose, I tried running DeltaStreamer on a
> kafka
> >> > topic
> >> > > > with 1.5 crore events with the following configuration in
> continuous
> >> > > mode -
> >> > > >
> >> > > > 1. changed the above variable to Integer.MAX_VALUE.
> >> > > > 2. Kept source limit as 3500000 (35 lacs)
> >> > > > 3. executor-memory 4g
> >> > > > 4. driver-memory 6g
> >> > > >
> >> > > > Basically in my case, the RDD was having 35 lac events in one
> >> iteration
> >> > > and
> >> > > > it was able to run fine.
> >> > > >
> >> > > > If I try running deltaStreamer with a greater value of
> sourceLimit,
> >> > then
> >> > > I
> >> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs
> >> looks
> >> > > like
> >> > > > sort of a sweet spot to run DeltaStreamer.
> >> > >
> >> >
> >>
> >
>

Reply via email to