HoodieMultiTableDeltaStreamer failing due to missing file path delimiter

Philip Ittmann Wed, 06 Oct 2021 04:06:53 -0700

Good day,

I am experiencing difficulty in getting a HoodieMultiTableDeltaStreamer
application to successfully run via spark-submit on an AWS EMR cluster with
the following versions:


Hudi release: 0.7.0

Release label:emr-6.3.0
Hadoop distribution:Amazon 3.2.1
Applications:Tez 0.9.2, Spark 3.1.1, Hive 3.1.2, Presto 0.245.1

The error I am seeing follows below, but the gist of the problem seems to
be related to leaving out a forward slash between the bucket name and the
filename when initializing a Path object.

After the error stack trace I include the spark-submit command as well as
the properties file I am using. Any help would be greatly appreciated!

Exception in thread "main" java.lang.IllegalArgumentException:
java.net.URISyntaxException: Relative path in absolute URI: s3://{{
bucket_name }}{{ hive_database }}.{{ hive_table }}.properties
        at org.apache.hadoop.fs.Path.initialize(Path.java:263)
        at org.apache.hadoop.fs.Path.<init>(Path.java:161)
        at org.apache.hadoop.fs.Path.<init>(Path.java:119)
        at
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.checkIfTableConfigFileExists(HoodieMultiTableDeltaStreamer.java:99)
        at
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateTableExecutionContextList(HoodieMultiTableDeltaStreamer.java:116)
        at
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.<init>(HoodieMultiTableDeltaStreamer.java:80)
        at
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.main(HoodieMultiTableDeltaStreamer.java:203)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at
org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
        at org.apache.spark.deploy.SparkSubmit.org
$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
        at
org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
        at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
        at
org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
        at
org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
        at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
        at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.net.URISyntaxException: Relative path in absolute URI:
s3://{{ bucket_name }}{{ hive_database }}.{{ hive_table }}.properties
        at java.net.URI.checkPath(URI.java:1823)
        at java.net.URI.<init>(URI.java:745)
        at org.apache.hadoop.fs.Path.initialize(Path.java:260)
        ... 18 more

The spark-submit command I am running is:

spark-submit \
--class
org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer \
--master yarn \
--deploy-mode client \
--num-executors 10 \
--executor-memory 3g \
--driver-memory 6g \
--conf spark.scheduler.mode=FAIR \
--conf spark.yarn.executor.memoryOverhead=1072 \
--conf spark.yarn.driver.memoryOverhead=2048 \
--conf spark.task.cpus=1 \
--conf spark.executor.cores=1 \
--conf spark.task.maxFailures=10 \
--conf spark.memory.fraction=0.4 \
--conf spark.rdd.compress=true \
--conf spark.kryoserializer.buffer.max=200m \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.memory.storageFraction=0.1 \
--conf spark.shuffle.service.enabled=true \
--conf spark.sql.hive.convertMetastoreParquet=false \
--conf spark.driver.maxResultSize=3g \
--conf spark.executor.heartbeatInterval=120s \
--conf spark.network.timeout=600s \
--conf spark.yarn.submit.waitAppCompletion=true \
--conf spark.sql.shuffle.partitions=100 \
/usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0-amzn-0.jar \
--base-path-prefix s3://{{ other_bucket_name }} \
--target-table dummy_topic \
--table-type COPY_ON_WRITE \
--config-folder s3://{{ bucket_name }} \
--props s3://{{ bucket_name }}/deltastreamer.properties \
--source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
--source-ordering-field __source_ts_ms \
--schemaprovider-class
org.apache.hudi.utilities.schema.SchemaRegistryProvider \
--hoodie-conf bootstrap.servers=<kafka_brokers> \
--hoodie-conf auto.offset.reset=earliest \
--hoodie-conf schema.registry.url=<schema_registry> \
--hoodie-conf
hoodie.deltastreamer.schemaprovider.registry.baseUrl=<schema_registry>/subjects/
\
--hoodie-conf
hoodie.deltastreamer.schemaprovider.registry.urlSuffix=-value/versions/latest
\
--hoodie-conf
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
\
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS \
--hoodie-conf
hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd \
--hoodie-conf hoodie.datasource.hive_sync.database={{ hive_database }}} \
--hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false \
--hoodie-conf
hoodie.datasource.hive_sync.partition_fields=_hoodie_partition_path \
--hoodie-conf
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
\
--enable-hive-sync \
--continuous

and the deltastreamer.properties file looks like this:

hoodie.deltastreamer.ingestion.tablesToBeIngested={{ hive_database }}.{{
hive_table }}
hoodie.deltastreamer.ingestion.{{ hive_database }}.{{ hive_table
}}.configFile=s3://{{ bucket_name }}/{{ hive_database }}.{{ hive_table
}}.properties

Once again, any help would be greatly appreciated!

Best wishes,

-- 

-

Philip Ittmann

Data Engineer

-

45, Kingfisher Drive,
Fourways, Sandton,
Gauteng, 2191, South Africa

paystack.com <http://www.paystack.com/>

HoodieMultiTableDeltaStreamer failing due to missing file path delimiter

Reply via email to