codope opened a new pull request, #6327:
URL: https://github.com/apache/hudi/pull/6327

   ### Change Logs
   
   - Shade `metrics-core` in `hudi-aws-bundle`
   - Remove duplicate includes in other bundles
   
   ### Impact
   
   Without this change, if Hudi metrics is turned on and metrics report type is 
Cloudwatch, then write client initialization fails. Stacktrace in HUDI-4568. 
The reason is that `metrics-core` is shaded in hudi-spark-bundle but not in 
hudi-aws-bundle but this dependency is used in hudi-aws for which we get 
NoMethodFoundError.
   
   **Risk level: none | low | medium | high**
   
   High
   
   Run below script (with metrics and hive turned on). Without this fix, the 
write will fail with NoMethodFoundError
   ```
   ./bin/pyspark \
     --jars 
/home/hadoop/hudi-spark3.2-bundle_2.12-0.13.0-SNAPSHOT.jar,/home/hadoop/hudi-aws-bundle-0.13.0-SNAPSHOT.jar
 \
     --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"   \
     --conf 
"spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog"
 \
     --conf 
"spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
   
   sc.setLogLevel("WARN")
   dataGen = sc._jvm.org.apache.hudi.QuickstartUtils.DataGenerator()
   inserts = sc._jvm.org.apache.hudi.QuickstartUtils.convertToStringList(
       dataGen.generateInserts(10)
   )
   from pyspark.sql.functions import expr
   
   df = spark.read.json(spark.sparkContext.parallelize(inserts, 10)).withColumn(
       "part", expr("'foo'")
   )
   tableName = "test_hudi_pyspark2"
   basePath = f"/tmp/{tableName}"
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "uuid",
       "hoodie.datasource.write.partitionpath.field": "part",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.upsert.shuffle.parallelism": 2,
       "hoodie.insert.shuffle.parallelism": 2,
       "hoodie.datasource.hive_sync.database": "default",
       "hoodie.datasource.hive_sync.table": tableName,
       "hoodie.datasource.hive_sync.mode": "hms",
       "hoodie.datasource.hive_sync.enable": "true",
       "hoodie.datasource.hive_sync.partition_fields": "part",
       "hoodie.datasource.hive_sync.partition_extractor_class": 
"org.apache.hudi.hive.MultiPartKeysValueExtractor",
       "hoodie.metrics.on": "true",
       "hoodie.metrics.reporter.type": "CLOUDWATCH"
   
   }
   
df.write.format("hudi").options(**hudi_options).mode("overwrite").save(basePath)
   ```
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to