https://issues.apache.org/jira/browse/HUDI-340 tracks this.

On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma <[email protected]>
wrote:

> Yeah,
>
> Would love to do that. Will create a jira and raise a PR.
>
> On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar <[email protected]> wrote:
>
>> Concurrent Writes :D
>>
>> The magic number 1M is from me actually :) . and there is no magic, it was
>> picked to keep jobs from batch scanning Kafka since source-limit default
>> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much
>> larger.
>> Happy to take a PR, to make this limit higher (say 10M) and only use it
>> when sourceLimit is infinity? Interested in contributing your change back?
>>
>> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma <[email protected]>
>> wrote:
>>
>> > Hi Nishith,
>> >
>> > I would like to know more about the reasonable payload size and topic
>> > throughput that you mentioned. :)
>> > Can you tell me few numbers around these two parameters which were
>> involved
>> > in deciding default value as 1000000.
>> >
>> > On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote:
>> >
>> > > Pratyaksh,
>> > >
>> > > The default value was chosen based on a “reasonable” payload size and
>> > > topic throughput.
>> > >
>> > > The number of messages vs executor/driver memory highly depends on
>> your
>> > > message size.
>> > > It is already a value that you can configure using “sourceLimit”, like
>> > > you’ve already tried.
>> > > Ideally, this number will be tuned by a user depending on the number
>> of
>> > > resources that can be provided vs ingestion latency.
>> > >
>> > > Sent from my iPhone
>> > >
>> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <
>> [email protected]>
>> > > wrote:
>> > > >
>> > > > Hi,
>> > > >
>> > > > I have a small doubt. KafkaOffsetGen.java class has a variable
>> called
>> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually
>> > reading
>> > > > from Kafka, we take the minimum of sourceLimit and this variable to
>> > > > actually form the RDD in case of KafkaSource.
>> > > >
>> > > > I want to know the following -
>> > > >
>> > > > 1. How did we arrive at this number?
>> > > > 2. Why are we hard-coding it? Should we not make it configurable for
>> > > users
>> > > > to play around?
>> > > >
>> > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka
>> > topic
>> > > > with 1.5 crore events with the following configuration in continuous
>> > > mode -
>> > > >
>> > > > 1. changed the above variable to Integer.MAX_VALUE.
>> > > > 2. Kept source limit as 3500000 (35 lacs)
>> > > > 3. executor-memory 4g
>> > > > 4. driver-memory 6g
>> > > >
>> > > > Basically in my case, the RDD was having 35 lac events in one
>> iteration
>> > > and
>> > > > it was able to run fine.
>> > > >
>> > > > If I try running deltaStreamer with a greater value of sourceLimit,
>> > then
>> > > I
>> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs
>> looks
>> > > like
>> > > > sort of a sweet spot to run DeltaStreamer.
>> > >
>> >
>>
>

Reply via email to