[
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141510#comment-17141510
]
Vinoth Chandar commented on HUDI-340:
-------------------------------------
> when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by chance,
> this looks like a lottery, very low probability.
I think the whole complexity is because of the default value of `sourceLimit`
in HoodieDeltaStreamer
{code:java}
public long sourceLimit = Long.MAX_VALUE; {code}
So, we would win this lottery quite often actually :) .
All the code below seems to be doing is detect this (added my
annotations/comments inline)
{code:java}
// VC: most likely this is going to be 5M, unless the user overrides.
long maxEventsToReadFromKafka =
props.getLong(Config.MAX_EVENTS_FROM_KAFKA_SOURCE_PROP,
Config.maxEventsFromKafkaSource);
// VC: this check will mostly return maxEventsToReadFromKafka? Seems
unnecessary given we already assigned by checkng the config above?
maxEventsToReadFromKafka = (maxEventsToReadFromKafka == Long.MAX_VALUE ||
maxEventsToReadFromKafka == Integer.MAX_VALUE)
? Config.maxEventsFromKafkaSource : maxEventsToReadFromKafka;
// VC: This line is trying to say, if you did not specify a source limit, we
cap you to the maximum 5M limit or use what you provided.. May be the whole
thing is more readable if it were a Option[Long] sourceLimit ?
long numEvents = sourceLimit == Long.MAX_VALUE ? maxEventsToReadFromKafka :
sourceLimit; {code}
if the user does configure a super high value for `sourceLimit` that
effectively scna s the whole topic, IMO we should allow that.. and assume users
knows what he/she is doign.
> Increase Default max events to read from kafka source
> -----------------------------------------------------
>
> Key: HUDI-340
> URL: https://issues.apache.org/jira/browse/HUDI-340
> Project: Apache Hudi
> Issue Type: Improvement
> Components: DeltaStreamer
> Reporter: Pratyaksh Sharma
> Assignee: Pratyaksh Sharma
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.5.1
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming
> records than this.
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)