[ 
https://issues.apache.org/jira/browse/HUDI-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17141510#comment-17141510
 ] 

Vinoth Chandar commented on HUDI-340:
-------------------------------------

> when the user sets the very Long.MAX_VALUE and Integer.MAX_VALUE by chance, 
> this looks like a lottery, very low probability.

I think the whole complexity is because of the default value of `sourceLimit` 
in HoodieDeltaStreamer 
{code:java}
public long sourceLimit = Long.MAX_VALUE; {code}
So, we would win this lottery quite often actually :) . 

All the code below seems to be doing is detect this (added my 
annotations/comments inline)
{code:java}
// VC: most likely this is going to be 5M, unless the user overrides.
long maxEventsToReadFromKafka = 
props.getLong(Config.MAX_EVENTS_FROM_KAFKA_SOURCE_PROP,
    Config.maxEventsFromKafkaSource);
// VC: this check will mostly return maxEventsToReadFromKafka?  Seems 
unnecessary given we already assigned by checkng the config above?
maxEventsToReadFromKafka = (maxEventsToReadFromKafka == Long.MAX_VALUE || 
maxEventsToReadFromKafka == Integer.MAX_VALUE)
    ? Config.maxEventsFromKafkaSource : maxEventsToReadFromKafka;
// VC: This line is trying to say, if you did not specify a source limit, we 
cap you to the maximum 5M limit or use what you provided.. May be the whole 
thing is more readable if it were a Option[Long] sourceLimit ? 
long numEvents = sourceLimit == Long.MAX_VALUE ? maxEventsToReadFromKafka : 
sourceLimit; {code}
if the user does configure a super high value for `sourceLimit` that 
effectively scna s the whole topic, IMO we should allow that.. and assume users 
knows what he/she is doign. 

 

 

> Increase Default max events to read from kafka source
> -----------------------------------------------------
>
>                 Key: HUDI-340
>                 URL: https://issues.apache.org/jira/browse/HUDI-340
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: DeltaStreamer
>            Reporter: Pratyaksh Sharma
>            Assignee: Pratyaksh Sharma
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.5.1
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Right now, DEFAULT_MAX_EVENTS_TO_READ is set to 1M in case of kafka source in 
> KafkaOffsetGen.java class. DeltaStreamer can handle much more incoming 
> records than this. 
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to