soma17dec commented on issue #4729: URL: https://github.com/apache/hudi/issues/4729#issuecomment-1033360986
Hi, I am currently running my process on AWS EMR services by using Hudi configs on Hudi 0.7 version and I am submitting my jobs using CLI on EMR cluster. ``` spark-submit \ --deploy-mode client \ --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \ --conf spark.shuffle.service.enabled=true \ --conf spark.default.parallelism=500 \ --conf spark.dynamicAllocation.enabled=true \ --conf spark.dynamicAllocation.initialExecutors=3 \ --conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.app.name=TABLENAME \ --jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar /usr/lib/hudi/hudi-utilities-bundle.jar \ --table-type MERGE_ON_READ \ --op INSERT \ --hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://172.0.0.0:10000 \ --source-ordering-field dms_seq_no \ --props s3://TABLENAME/TABLENAME_full.properties \ --hoodie-conf hoodie.datasource.hive_sync.database=default \ --target-base-path s3://raw_bucket/TABLENAME \ --target-table TABLENAME \ --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \ --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://landing_bucket/TABLENAME/ \ --source-class org.apache.hudi.utilities.sources.ParquetDFSSource `--enable-sync ``` config file - ``` hoodie.datasource.write.recordkey.field=id hoodie.datasource.write.partitionpath.field=partition hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.SimpleKeyGenerator hoodie.datasource.hive_sync.table=tablename hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.assume_date_partitioning=false hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.small.file.limit=134217728 hoodie.parquet.max.file.size=268435456 hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=1 hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,PAR_COL as partition,* from <SRC> hoodie.datasource.hive_sync.support_timestamp=false hoodie.datasource.compaction.async.enable=true hoodie.index.type=BLOOM hoodie.compact.inline=true hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5 hoodie.metadata.compact.max.delta.commits=5 hoodie.clean.automatic=true hoodie.clean.async=true #hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,CAST(date_created AS DATE) AS date_created_part,* from <SRC> ``` Not sure how to use above mentioned patch in the CLI mode to update records in a raw Hudi table. Please let me know which version has the properties that support CLI? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org