Hi Pratyaksh, I get what you mean. You are concerned about the upper cap of events read, always being 1000000 even though users can configure it to be lower than that using sourceLimit. Since we are choosing Math.min(1000000, sourceLimit), I think it would make sense to make the upper cap configurable instead of setting to default 1000000.
@vinoth <[email protected]> what do you think? Thanks, Sudha On Fri, Nov 15, 2019 at 5:20 AM Pratyaksh Sharma <[email protected]> wrote: > Hi Nishith, > > I would like to know more about the reasonable payload size and topic > throughput that you mentioned. :) > Can you tell me few numbers around these two parameters which were involved > in deciding default value as 1000000. > > On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote: > > > Pratyaksh, > > > > The default value was chosen based on a “reasonable” payload size and > > topic throughput. > > > > The number of messages vs executor/driver memory highly depends on your > > message size. > > It is already a value that you can configure using “sourceLimit”, like > > you’ve already tried. > > Ideally, this number will be tuned by a user depending on the number of > > resources that can be provided vs ingestion latency. > > > > Sent from my iPhone > > > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <[email protected]> > > wrote: > > > > > > Hi, > > > > > > I have a small doubt. KafkaOffsetGen.java class has a variable called > > > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually > reading > > > from Kafka, we take the minimum of sourceLimit and this variable to > > > actually form the RDD in case of KafkaSource. > > > > > > I want to know the following - > > > > > > 1. How did we arrive at this number? > > > 2. Why are we hard-coding it? Should we not make it configurable for > > users > > > to play around? > > > > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka > topic > > > with 1.5 crore events with the following configuration in continuous > > mode - > > > > > > 1. changed the above variable to Integer.MAX_VALUE. > > > 2. Kept source limit as 3500000 (35 lacs) > > > 3. executor-memory 4g > > > 4. driver-memory 6g > > > > > > Basically in my case, the RDD was having 35 lac events in one iteration > > and > > > it was able to run fine. > > > > > > If I try running deltaStreamer with a greater value of sourceLimit, > then > > I > > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks > > like > > > sort of a sweet spot to run DeltaStreamer. > > >
