soma17dec commented on issue #4729:
URL: https://github.com/apache/hudi/issues/4729#issuecomment-1033360986


   Hi,
   
   I am currently running my process on AWS EMR services by using Hudi configs 
on Hudi 0.7 version and I am submitting my jobs using CLI on EMR cluster. 
   
   ```
   spark-submit \
   --deploy-mode client \
   --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
   --conf spark.shuffle.service.enabled=true \
   --conf spark.default.parallelism=500 \
   --conf spark.dynamicAllocation.enabled=true \
   --conf spark.dynamicAllocation.initialExecutors=3 \
   --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \
   --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
   --conf spark.app.name=TABLENAME \
   --jars 
/usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar 
/usr/lib/hudi/hudi-utilities-bundle.jar \
   --table-type MERGE_ON_READ \
   --op INSERT \
   --hoodie-conf 
hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://172.0.0.0:10000 \
   --source-ordering-field dms_seq_no \
   --props s3://TABLENAME/TABLENAME_full.properties \
   --hoodie-conf hoodie.datasource.hive_sync.database=default \
   --target-base-path s3://raw_bucket/TABLENAME \
   --target-table TABLENAME \
   --transformer-class 
org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
   --hoodie-conf 
hoodie.deltastreamer.source.dfs.root=s3://landing_bucket/TABLENAME/ \
   --source-class org.apache.hudi.utilities.sources.ParquetDFSSource 
`--enable-sync
   ```
   
   config file - 
   
   
   ```
   hoodie.datasource.write.recordkey.field=id
   hoodie.datasource.write.partitionpath.field=partition
   
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator
   hoodie.datasource.hive_sync.table=tablename
   hoodie.datasource.hive_sync.enable=true
   hoodie.datasource.hive_sync.assume_date_partitioning=false
   
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor
   hoodie.parquet.small.file.limit=134217728
   hoodie.parquet.max.file.size=268435456
   hoodie.cleaner.policy=KEEP_LATEST_COMMITS
   hoodie.cleaner.commits.retained=1
   hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE 
FALSE END AS _hoodie_is_deleted,PAR_COL as partition,* from <SRC>
   hoodie.datasource.hive_sync.support_timestamp=false
   hoodie.datasource.compaction.async.enable=true
   hoodie.index.type=BLOOM
   hoodie.compact.inline=true
   hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5
   hoodie.metadata.compact.max.delta.commits=5
   hoodie.clean.automatic=true
   hoodie.clean.async=true
   #hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE 
FALSE END AS _hoodie_is_deleted,CAST(date_created AS DATE) AS 
date_created_part,* from <SRC>
   ```
   
   
   Not sure how to use above mentioned patch in the CLI mode to update records 
in a raw Hudi table. 
   
   Please let me know which version has the properties that support CLI?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to