Hi, I have a small doubt. KafkaOffsetGen.java class has a variable called DEFAULT_MAX_EVENTS_TO_READ which is set to 1000000. When actually reading from Kafka, we take the minimum of sourceLimit and this variable to actually form the RDD in case of KafkaSource.
I want to know the following - 1. How did we arrive at this number? 2. Why are we hard-coding it? Should we not make it configurable for users to play around? For bootstrapping purpose, I tried running DeltaStreamer on a kafka topic with 1.5 crore events with the following configuration in continuous mode - 1. changed the above variable to Integer.MAX_VALUE. 2. Kept source limit as 3500000 (35 lacs) 3. executor-memory 4g 4. driver-memory 6g Basically in my case, the RDD was having 35 lac events in one iteration and it was able to run fine. If I try running deltaStreamer with a greater value of sourceLimit, then I was getting OutOfMemory and heap memory errors. Keeping 35 lacs looks like sort of a sweet spot to run DeltaStreamer.
