Re: HoodieMultiTableDeltaStreamer failing due to missing file path delimiter

Pratyaksh Sharma Fri, 08 Oct 2021 07:43:33 -0700

Hi Philip,

I checked the configs that you are passing and it all looks good. Indeed
the problem is the absence of forward slash which should not happen in
general. Can you try printing the configs once and see if the configFile
path is getting passed properly?


Also as a workaround, you can create `s3://{{ bucket_name }}/{{
hive_database }}_{{ hive_table }}_config.properties` as the properties file
for table overridden properties and not mention the property
`hoodie.deltastreamer.ingestion.{{ hive_database }}.{{ hive_table
}}.configFile` at all in deltastreamer.properties file. You can find more
information here -
https://hudi.apache.org/blog/2020/08/22/ingest-multiple-tables-using-hudi.

Hope that helps!

On Wed, Oct 6, 2021 at 4:36 PM Philip Ittmann <[email protected]>
wrote:

> Good day,
>
> I am experiencing difficulty in getting a HoodieMultiTableDeltaStreamer
> application to successfully run via spark-submit on an AWS EMR cluster with
> the following versions:
>
> Hudi release: 0.7.0
>
> Release label:emr-6.3.0
> Hadoop distribution:Amazon 3.2.1
> Applications:Tez 0.9.2, Spark 3.1.1, Hive 3.1.2, Presto 0.245.1
>
> The error I am seeing follows below, but the gist of the problem seems to
> be related to leaving out a forward slash between the bucket name and the
> filename when initializing a Path object.
>
> After the error stack trace I include the spark-submit command as well as
> the properties file I am using. Any help would be greatly appreciated!
>
> Exception in thread "main" java.lang.IllegalArgumentException:
> java.net.URISyntaxException: Relative path in absolute URI: s3://{{
> bucket_name }}{{ hive_database }}.{{ hive_table }}.properties
>         at org.apache.hadoop.fs.Path.initialize(Path.java:263)
>         at org.apache.hadoop.fs.Path.<init>(Path.java:161)
>         at org.apache.hadoop.fs.Path.<init>(Path.java:119)
>         at
>
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.checkIfTableConfigFileExists(HoodieMultiTableDeltaStreamer.java:99)
>         at
>
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateTableExecutionContextList(HoodieMultiTableDeltaStreamer.java:116)
>         at
>
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.<init>(HoodieMultiTableDeltaStreamer.java:80)
>         at
>
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.main(HoodieMultiTableDeltaStreamer.java:203)
>         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>         at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>         at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at
>
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
>         at org.apache.spark.deploy.SparkSubmit.org
> $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
>         at
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
>         at
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
>         at
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
>         at
>
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
>         at
> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
>         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> s3://{{ bucket_name }}{{ hive_database }}.{{ hive_table }}.properties
>         at java.net.URI.checkPath(URI.java:1823)
>         at java.net.URI.<init>(URI.java:745)
>         at org.apache.hadoop.fs.Path.initialize(Path.java:260)
>         ... 18 more
>
> The spark-submit command I am running is:
>
> spark-submit \
> --class
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer \
> --master yarn \
> --deploy-mode client \
> --num-executors 10 \
> --executor-memory 3g \
> --driver-memory 6g \
> --conf spark.scheduler.mode=FAIR \
> --conf spark.yarn.executor.memoryOverhead=1072 \
> --conf spark.yarn.driver.memoryOverhead=2048 \
> --conf spark.task.cpus=1 \
> --conf spark.executor.cores=1 \
> --conf spark.task.maxFailures=10 \
> --conf spark.memory.fraction=0.4 \
> --conf spark.rdd.compress=true \
> --conf spark.kryoserializer.buffer.max=200m \
> --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> --conf spark.memory.storageFraction=0.1 \
> --conf spark.shuffle.service.enabled=true \
> --conf spark.sql.hive.convertMetastoreParquet=false \
> --conf spark.driver.maxResultSize=3g \
> --conf spark.executor.heartbeatInterval=120s \
> --conf spark.network.timeout=600s \
> --conf spark.yarn.submit.waitAppCompletion=true \
> --conf spark.sql.shuffle.partitions=100 \
> /usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0-amzn-0.jar \
> --base-path-prefix s3://{{ other_bucket_name }} \
> --target-table dummy_topic \
> --table-type COPY_ON_WRITE \
> --config-folder s3://{{ bucket_name }} \
> --props s3://{{ bucket_name }}/deltastreamer.properties \
> --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
> --source-ordering-field __source_ts_ms \
> --schemaprovider-class
> org.apache.hudi.utilities.schema.SchemaRegistryProvider \
> --hoodie-conf bootstrap.servers=<kafka_brokers> \
> --hoodie-conf auto.offset.reset=earliest \
> --hoodie-conf schema.registry.url=<schema_registry> \
> --hoodie-conf
>
> hoodie.deltastreamer.schemaprovider.registry.baseUrl=<schema_registry>/subjects/
> \
> --hoodie-conf
>
> hoodie.deltastreamer.schemaprovider.registry.urlSuffix=-value/versions/latest
> \
> --hoodie-conf
>
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
> \
> --hoodie-conf
> hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS \
> --hoodie-conf
> hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd \
> --hoodie-conf hoodie.datasource.hive_sync.database={{ hive_database }}} \
> --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false \
> --hoodie-conf
> hoodie.datasource.hive_sync.partition_fields=_hoodie_partition_path \
> --hoodie-conf
>
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
> \
> --enable-hive-sync \
> --continuous
>
> and the deltastreamer.properties file looks like this:
>
> hoodie.deltastreamer.ingestion.tablesToBeIngested={{ hive_database }}.{{
> hive_table }}
> hoodie.deltastreamer.ingestion.{{ hive_database }}.{{ hive_table
> }}.configFile=s3://{{ bucket_name }}/{{ hive_database }}.{{ hive_table
> }}.properties
>
> Once again, any help would be greatly appreciated!
>
> Best wishes,
>
> --
>
> -
>
> Philip Ittmann
>
> Data Engineer
>
> -
>
> 45, Kingfisher Drive,
> Fourways, Sandton,
> Gauteng, 2191, South Africa
>
> paystack.com <http://www.paystack.com/>
>

Re: HoodieMultiTableDeltaStreamer failing due to missing file path delimiter

Reply via email to