Hi Nishith, I would like to know more about the reasonable payload size and topic throughput that you mentioned. :) Can you tell me few numbers around these two parameters which were involved in deciding default value as 1000000.
On Fri, Nov 15, 2019 at 5:32 PM Nishith <[email protected]> wrote: > Pratyaksh, > > The default value was chosen based on a “reasonable” payload size and > topic throughput. > > The number of messages vs executor/driver memory highly depends on your > message size. > It is already a value that you can configure using “sourceLimit”, like > you’ve already tried. > Ideally, this number will be tuned by a user depending on the number of > resources that can be provided vs ingestion latency. > > Sent from my iPhone > > > On Nov 15, 2019, at 5:00 PM, Pratyaksh Sharma <[email protected]> > wrote: > > > > Hi, > > > > I have a small doubt. KafkaOffsetGen.java class has a variable called > > DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually reading > > from Kafka, we take the minimum of sourceLimit and this variable to > > actually form the RDD in case of KafkaSource. > > > > I want to know the following - > > > > 1. How did we arrive at this number? > > 2. Why are we hard-coding it? Should we not make it configurable for > users > > to play around? > > > > For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic > > with 1.5 crore events with the following configuration in continuous > mode - > > > > 1. changed the above variable to Integer.MAX_VALUE. > > 2. Kept source limit as 3500000 (35 lacs) > > 3. executor-memory 4g > > 4. driver-memory 6g > > > > Basically in my case, the RDD was having 35 lac events in one iteration > and > > it was able to run fine. > > > > If I try running deltaStreamer with a greater value of sourceLimit, then > I > > was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks > like > > sort of a sweet spot to run DeltaStreamer. >
