[ https://issues.apache.org/jira/browse/HUDI-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173348#comment-17173348 ]
Balaji Varadarajan edited comment on HUDI-1146 at 8/7/20, 5:25 PM: ------------------------------------------------------------------- [~bdscheller]: I think InputBatch::getSchemaProvider will be called irrespective of whether input batch is empty or not. I am suspecting this to be similar to HUDI-1091 where an empty input batch is triggering this case. Can you try this change ? {code:java} --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java @@ -321,7 +321,7 @@ public class DeltaSync implements Serializable { .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, jssc, dataAndCheckpoint.getSchemaProvider(), new RowBasedSchemaProvider(r.schema()))) - .orElse(dataAndCheckpoint.getSchemaProvider()); + .orElseGet((dataAndCheckpoint::getSchemaProvider)); avroRDDOptional = transformed .map(t -> AvroConversionUtils.createRdd( t, HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD()); {code} was (Author: vbalaji): [~bdscheller]: I think InputBatch::getSchemaProvider will be called irrespective of whether input batch is empty or not. I am suspecting this to be similar to HUDI-1091 where an empty input batch is triggering this case. Can you try this change ? {code:java} --- a/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java +++ b/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java @@ -321,7 +321,7 @@ public class DeltaSync implements Serializable { .map(r -> (SchemaProvider) new DelegatingSchemaProvider(props, jssc, dataAndCheckpoint.getSchemaProvider(), new RowBasedSchemaProvider(r.schema()))) - .orElse(dataAndCheckpoint.getSchemaProvider()); + .orElseGet((dataAndCheckpoint::getSchemaProvider)); avroRDDOptional = transformed .map(t -> AvroConversionUtils.createRdd( t, HOODIE_RECORD_STRUCT_NAME, HOODIE_RECORD_NAMESPACE).toJavaRDD()); {code} > DeltaStreamer fails to start when No updated records + schemaProvider not > supplied > ---------------------------------------------------------------------------------- > > Key: HUDI-1146 > URL: https://issues.apache.org/jira/browse/HUDI-1146 > Project: Apache Hudi > Issue Type: Bug > Components: Hive Integration > Reporter: Brandon Scheller > Priority: Major > > DeltaStreamer issue — happens with both COW or MOR - Restarting the > DeltaStreamer Process crashes, that is, 2nd Run does nothing. > Steps: > Run Hudi DeltaStreamer job in --continuous mode > Run the same job again without deleting the output parquet files generated > due to step above > 2nd run crashes with the below error ( it does not crash if we delete the > output parquet file) > {{Caused by: org.apache.hudi.exception.HoodieException: Please provide a > valid schema provider class!}} > {{ at > org.apache.hudi.utilities.sources.InputBatch.getSchemaProvider(InputBatch.java:53)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.readFromSource(DeltaSync.java:312)}} > {{ at > org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:226)}} > {{ at > org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.lambda$startService$0(HoodieDeltaStreamer.java:392)}} > > {{This looks to be because of this line:}} > {{[https://github.com/apache/hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java#L315] > }} > The "orElse" block here doesn't seem to make sense as if "transformed" is > empty then it is likely "dataAndCheckpoint" will have a null schema provider -- This message was sent by Atlassian Jira (v8.3.4#803005)