Thanks! On Sun, Nov 17, 2019 at 4:59 AM Pratyaksh Sharma <[email protected]> wrote:
> https://issues.apache.org/jira/browse/HUDI-340 tracks this. > > On Sun, Nov 17, 2019 at 6:00 PM Pratyaksh Sharma <[email protected]> > wrote: > > > Yeah, > > > > Would love to do that. Will create a jira and raise a PR. > > > > On Fri, Nov 15, 2019 at 7:30 PM Vinoth Chandar <[email protected]> > wrote: > > > >> Concurrent Writes :D > >> > >> The magic number 1M is from me actually :) . and there is no magic, it > was > >> picked to keep jobs from batch scanning Kafka since source-limit default > >> was Long.MAX_VALUE (for dfs source).. I acknowledge you could go much > >> larger. > >> Happy to take a PR, to make this limit higher (say 10M) and only use it > >> when sourceLimit is infinity? Interested in contributing your change > back? > >> > >> On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma <[email protected] > > > >> wrote: > >> > >> > Hi Nishith, > >> > > >> > I would like to know more about the reasonable payload size and topic > >> > throughput that you mentioned. :) > >> > Can you tell me few numbers around these two parameters which were > >> involved > >> > in deciding default value as 1000000. > >> > > >> > On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote: > >> > > >> > > Pratyaksh, > >> > > > >> > > The default value was chosen based on a “reasonable” payload size > and > >> > > topic throughput. > >> > > > >> > > The number of messages vs executor/driver memory highly depends on > >> your > >> > > message size. > >> > > It is already a value that you can configure using “sourceLimit”, > like > >> > > you’ve already tried. > >> > > Ideally, this number will be tuned by a user depending on the number > >> of > >> > > resources that can be provided vs ingestion latency. > >> > > > >> > > Sent from my iPhone > >> > > > >> > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma < > >> [email protected]> > >> > > wrote: > >> > > > > >> > > > Hi, > >> > > > > >> > > > I have a small doubt. KafkaOffsetGen.java class has a variable > >> called > >> > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually > >> > reading > >> > > > from Kafka, we take the minimum of sourceLimit and this variable > to > >> > > > actually form the RDD in case of KafkaSource. > >> > > > > >> > > > I want to know the following - > >> > > > > >> > > > 1. How did we arrive at this number? > >> > > > 2. Why are we hard-coding it? Should we not make it configurable > for > >> > > users > >> > > > to play around? > >> > > > > >> > > > For bootstrapping purpose, I tried running DeltaStreamer on a > kafka > >> > topic > >> > > > with 1.5 crore events with the following configuration in > continuous > >> > > mode - > >> > > > > >> > > > 1. changed the above variable to Integer.MAX_VALUE. > >> > > > 2. Kept source limit as 3500000 (35 lacs) > >> > > > 3. executor-memory 4g > >> > > > 4. driver-memory 6g > >> > > > > >> > > > Basically in my case, the RDD was having 35 lac events in one > >> iteration > >> > > and > >> > > > it was able to run fine. > >> > > > > >> > > > If I try running deltaStreamer with a greater value of > sourceLimit, > >> > then > >> > > I > >> > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs > >> looks > >> > > like > >> > > > sort of a sweet spot to run DeltaStreamer. > >> > > > >> > > >> > > >
