Hi Pratyaksh,

Thank you very much for your help. After reading through your advice, the
workaround I came to was setting `--config-folder s3:\/\/{{
s3_config_bucket }}\/\/` (note the double forward slash at the end). This
circumvents the slash being deleted at the end of the `--config-folder`
parameter [1]. I am still unsure as to what the root cause of the problem
is, but I can move forward with my work using the work around. Thank you
again for your assistance!

Best wishes,
Philip

[1]
https://github.com/apache/hudi/blob/da65d3cae99e8fee0ede9b5ed8630a3716d284c8/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieMultiTableDeltaStreamer.java#L345

On Fri, Oct 8, 2021 at 4:43 PM Pratyaksh Sharma <[email protected]>
wrote:

> Hi Philip,
>
> I checked the configs that you are passing and it all looks good. Indeed
> the problem is the absence of forward slash which should not happen in
> general. Can you try printing the configs once and see if the configFile
> path is getting passed properly?
>
> Also as a workaround, you can create `s3://{{ bucket_name }}/{{
> hive_database }}_{{ hive_table }}_config.properties` as the properties file
> for table overridden properties and not mention the property
> `hoodie.deltastreamer.ingestion.{{ hive_database }}.{{ hive_table
> }}.configFile` at all in deltastreamer.properties file. You can find more
> information here -
> https://hudi.apache.org/blog/2020/08/22/ingest-multiple-tables-using-hudi.
>
> Hope that helps!
>
> On Wed, Oct 6, 2021 at 4:36 PM Philip Ittmann <[email protected]
> >
> wrote:
>
> > Good day,
> >
> > I am experiencing difficulty in getting a HoodieMultiTableDeltaStreamer
> > application to successfully run via spark-submit on an AWS EMR cluster
> with
> > the following versions:
> >
> > Hudi release: 0.7.0
> >
> > Release label:emr-6.3.0
> > Hadoop distribution:Amazon 3.2.1
> > Applications:Tez 0.9.2, Spark 3.1.1, Hive 3.1.2, Presto 0.245.1
> >
> > The error I am seeing follows below, but the gist of the problem seems to
> > be related to leaving out a forward slash between the bucket name and the
> > filename when initializing a Path object.
> >
> > After the error stack trace I include the spark-submit command as well as
> > the properties file I am using. Any help would be greatly appreciated!
> >
> > Exception in thread "main" java.lang.IllegalArgumentException:
> > java.net.URISyntaxException: Relative path in absolute URI: s3://{{
> > bucket_name }}{{ hive_database }}.{{ hive_table }}.properties
> >         at org.apache.hadoop.fs.Path.initialize(Path.java:263)
> >         at org.apache.hadoop.fs.Path.<init>(Path.java:161)
> >         at org.apache.hadoop.fs.Path.<init>(Path.java:119)
> >         at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.checkIfTableConfigFileExists(HoodieMultiTableDeltaStreamer.java:99)
> >         at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.populateTableExecutionContextList(HoodieMultiTableDeltaStreamer.java:116)
> >         at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.<init>(HoodieMultiTableDeltaStreamer.java:80)
> >         at
> >
> >
> org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer.main(HoodieMultiTableDeltaStreamer.java:203)
> >         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> >         at
> >
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> >         at
> >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> >         at java.lang.reflect.Method.invoke(Method.java:498)
> >         at
> >
> >
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
> >         at org.apache.spark.deploy.SparkSubmit.org
> > $apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:959)
> >         at
> > org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
> >         at
> > org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
> >         at
> > org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
> >         at
> >
> >
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1038)
> >         at
> > org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1047)
> >         at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
> > Caused by: java.net.URISyntaxException: Relative path in absolute URI:
> > s3://{{ bucket_name }}{{ hive_database }}.{{ hive_table }}.properties
> >         at java.net.URI.checkPath(URI.java:1823)
> >         at java.net.URI.<init>(URI.java:745)
> >         at org.apache.hadoop.fs.Path.initialize(Path.java:260)
> >         ... 18 more
> >
> > The spark-submit command I am running is:
> >
> > spark-submit \
> > --class
> > org.apache.hudi.utilities.deltastreamer.HoodieMultiTableDeltaStreamer \
> > --master yarn \
> > --deploy-mode client \
> > --num-executors 10 \
> > --executor-memory 3g \
> > --driver-memory 6g \
> > --conf spark.scheduler.mode=FAIR \
> > --conf spark.yarn.executor.memoryOverhead=1072 \
> > --conf spark.yarn.driver.memoryOverhead=2048 \
> > --conf spark.task.cpus=1 \
> > --conf spark.executor.cores=1 \
> > --conf spark.task.maxFailures=10 \
> > --conf spark.memory.fraction=0.4 \
> > --conf spark.rdd.compress=true \
> > --conf spark.kryoserializer.buffer.max=200m \
> > --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
> > --conf spark.memory.storageFraction=0.1 \
> > --conf spark.shuffle.service.enabled=true \
> > --conf spark.sql.hive.convertMetastoreParquet=false \
> > --conf spark.driver.maxResultSize=3g \
> > --conf spark.executor.heartbeatInterval=120s \
> > --conf spark.network.timeout=600s \
> > --conf spark.yarn.submit.waitAppCompletion=true \
> > --conf spark.sql.shuffle.partitions=100 \
> > /usr/lib/hudi/hudi-utilities-bundle_2.12-0.7.0-amzn-0.jar \
> > --base-path-prefix s3://{{ other_bucket_name }} \
> > --target-table dummy_topic \
> > --table-type COPY_ON_WRITE \
> > --config-folder s3://{{ bucket_name }} \
> > --props s3://{{ bucket_name }}/deltastreamer.properties \
> > --source-class org.apache.hudi.utilities.sources.AvroKafkaSource \
> > --source-ordering-field __source_ts_ms \
> > --schemaprovider-class
> > org.apache.hudi.utilities.schema.SchemaRegistryProvider \
> > --hoodie-conf bootstrap.servers=<kafka_brokers> \
> > --hoodie-conf auto.offset.reset=earliest \
> > --hoodie-conf schema.registry.url=<schema_registry> \
> > --hoodie-conf
> >
> >
> hoodie.deltastreamer.schemaprovider.registry.baseUrl=<schema_registry>/subjects/
> > \
> > --hoodie-conf
> >
> >
> hoodie.deltastreamer.schemaprovider.registry.urlSuffix=-value/versions/latest
> > \
> > --hoodie-conf
> >
> >
> hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.TimestampBasedKeyGenerator
> > \
> > --hoodie-conf
> > hoodie.deltastreamer.keygen.timebased.timestamp.type=EPOCHMILLISECONDS \
> > --hoodie-conf
> > hoodie.deltastreamer.keygen.timebased.output.dateformat=yyyy/MM/dd \
> > --hoodie-conf hoodie.datasource.hive_sync.database={{ hive_database }}} \
> > --hoodie-conf hoodie.datasource.hive_sync.assume_date_partitioning=false
> \
> > --hoodie-conf
> > hoodie.datasource.hive_sync.partition_fields=_hoodie_partition_path \
> > --hoodie-conf
> >
> >
> hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.SlashEncodedDayPartitionValueExtractor
> > \
> > --enable-hive-sync \
> > --continuous
> >
> > and the deltastreamer.properties file looks like this:
> >
> > hoodie.deltastreamer.ingestion.tablesToBeIngested={{ hive_database }}.{{
> > hive_table }}
> > hoodie.deltastreamer.ingestion.{{ hive_database }}.{{ hive_table
> > }}.configFile=s3://{{ bucket_name }}/{{ hive_database }}.{{ hive_table
> > }}.properties
> >
> > Once again, any help would be greatly appreciated!
> >
> > Best wishes,
> >
> > --
> >
> > -
> >
> > Philip Ittmann
> >
> > Data Engineer
> >
> > -
> >
> > 45, Kingfisher Drive,
> > Fourways, Sandton,
> > Gauteng, 2191, South Africa
> >
> > paystack.com <http://www.paystack.com/>
> >
>


-- 

-

Philip Ittmann

Data Engineer

-

45, Kingfisher Drive,
Fourways, Sandton,
Gauteng, 2191, South Africa

paystack.com <http://www.paystack.com/>

Reply via email to