[ https://issues.apache.org/jira/browse/HUDI-7390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Danny Chen updated HUDI-7390: ----------------------------- Fix Version/s: 0.16.0 1.0.0 (was: 0.15.0) > [Regression] HoodieStreamer no longer works without --props being supplied > -------------------------------------------------------------------------- > > Key: HUDI-7390 > URL: https://issues.apache.org/jira/browse/HUDI-7390 > Project: Apache Hudi > Issue Type: Bug > Components: deltastreamer > Affects Versions: 1.0.0-beta1, 0.14.1 > Reporter: Brandon Dahler > Assignee: Vova Kolmakov > Priority: Major > Labels: pull-request-available > Fix For: 0.16.0, 1.0.0 > > Attachments: spark.log > > > h2. Problem > When attempting to run HoodieStreamer without a props file, specifying all > required extra configuration via {{--hoodie-conf}} parameters, the execution > fails and an exception is thrown: > {code:java} > 24/02/06 22:15:13 INFO SparkContext: Successfully stopped SparkContext > Exception in thread "main" org.apache.hudi.exception.HoodieIOException: > Cannot read properties from dfs from file > file:/private/tmp/hudi-props-repro/src/test/resources/streamer-config/dfs-source.properties > at > org.apache.hudi.common.config.DFSPropertiesConfiguration.addPropsFromFile(DFSPropertiesConfiguration.java:166) > at > org.apache.hudi.common.config.DFSPropertiesConfiguration.<init>(DFSPropertiesConfiguration.java:85) > at > org.apache.hudi.utilities.UtilHelpers.readConfig(UtilHelpers.java:232) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$Config.getProps(HoodieStreamer.java:437) > at > org.apache.hudi.utilities.streamer.StreamSync.getDeducedSchemaProvider(StreamSync.java:656) > at > org.apache.hudi.utilities.streamer.StreamSync.fetchNextBatchFromSource(StreamSync.java:632) > at > org.apache.hudi.utilities.streamer.StreamSync.fetchFromSourceAndPrepareRecords(StreamSync.java:525) > at > org.apache.hudi.utilities.streamer.StreamSync.readFromSource(StreamSync.java:498) > at > org.apache.hudi.utilities.streamer.StreamSync.syncOnce(StreamSync.java:404) > at > org.apache.hudi.utilities.streamer.HoodieStreamer$StreamSyncService.ingestOnce(HoodieStreamer.java:850) > at > org.apache.hudi.utilities.ingestion.HoodieIngestionService.startIngestion(HoodieIngestionService.java:72) > at org.apache.hudi.common.util.Option.ifPresent(Option.java:97) > at > org.apache.hudi.utilities.streamer.HoodieStreamer.sync(HoodieStreamer.java:207) > at > org.apache.hudi.utilities.streamer.HoodieStreamer.main(HoodieStreamer.java:592) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.base/java.lang.reflect.Method.invoke(Method.java:568) > at > org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) > at > org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1020) > at > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) > at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) > at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) > at > org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > Caused by: java.io.FileNotFoundException: File > file:/private/tmp/hudi-props-repro/src/test/resources/streamer-config/dfs-source.properties > does not exist > at > org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:779) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:1100) > at > org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:769) > at > org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:462) > at > org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:160) > at > org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:372) > at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:976) > at > org.apache.hudi.common.config.DFSPropertiesConfiguration.addPropsFromFile(DFSPropertiesConfiguration.java:161) > ... 25 more {code} > h2. Reproduction Steps > 1. Setup clean spark install > {code:java} > mkdir /tmp/hudi-props-repro > cd /tmp/hudi-props-repro > tar -xvzf ~/spark-3.4.2-bin-hadoop3.tgz{code} > 2. Copy the schema file from the docker demo > {code:java} > wget > https://raw.githubusercontent.com/apache/hudi/release-0.14.1/docker/demo/config/schema.avsc > {code} > 3. Copy data file from the docker demo > {code:java} > mkdir data > cd data > wget > https://raw.githubusercontent.com/apache/hudi/release-0.14.1/docker/demo/data/batch_1.json > > cd .. {code} > 4. Run HoodieStreamer > {code:java} > spark-3.4.2-bin-hadoop3/bin/spark-submit \ > --packages > org.apache.hudi:hudi-utilities-slim-bundle_2.12:0.14.1,org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.1 > \ > --conf spark.kryoserializer.buffer.max=200m \ > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ > --class org.apache.hudi.utilities.streamer.HoodieStreamer \ > spark-3.4.2-bin-hadoop3/examples/jars/spark-examples_2.12-3.4.2.jar \ > --table-type COPY_ON_WRITE \ > --source-class org.apache.hudi.utilities.sources.JsonDFSSource \ > --target-base-path /tmp/hudi-props-repro/table \ > --target-table table \ > --hoodie-conf hoodie.datasource.write.recordkey.field=key \ > --hoodie-conf hoodie.datasource.write.partitionpath.field=date \ > --hoodie-conf hoodie.table.recordkey.fields=key \ > --hoodie-conf hoodie.table.partition.fields=date \ > --hoodie-conf > hoodie.streamer.schemaprovider.source.schema.file=/tmp/hudi-props-repro/schema.avsc > \ > --hoodie-conf > hoodie.streamer.schemaprovider.target.schema.file=/tmp/hudi-props-repro/schema.avsc > \ > --hoodie-conf hoodie.streamer.source.dfs.root=/tmp/hudi-props-repro/data \ > --schemaprovider-class > org.apache.hudi.utilities.schema.FilebasedSchemaProvider > {code} > h3. Expected Results > Command runs successfully, data is ingested successfully into > /{{{}tmp/hudi-decimal-repro/table{}}}, some files exist under > {{{}/tmp/hudi-decimal-repro/table/2018/08/31/{}}}. > h3. Actual Results > Command fails with exception, no data is ingsted into the table. The table's > directory is initialized but no commits exist. > Logs of the attempted run are attached as spark.log > h2. Additional Information > This issue does not appear to exist in versions 0.12.2 through 0.14.0 based > on my own testing. It does affect both the 0.14.1 and 1.0.0-beta1 releases. > h3. Known Workarounds > You should be able to pass {{--props}} referencing either an empty file or > even {{/dev/null}} to work around this issue. Passing an empty string or a > reference to a non-existent file will *not* work. -- This message was sent by Atlassian Jira (v8.20.10#820010)