tooptoop4 opened a new issue #1955: URL: https://github.com/apache/hudi/issues/1955
/home/ec2-user/spark_home/bin/spark-submit --conf "spark.hadoop.fs.s3a.proxy.host=redact" --conf "spark.hadoop.fs.s3a.proxy.port=redact" --conf "spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf "spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars "/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 --deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar --table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync --hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf hoodie.datasource.hive_sync.table=dmstest_multpk3 --hoodie-conf hoodie.datasource.hive_sync.partition_fields="sys_user" --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor --hoodie-conf hoodie.datasource.hive _sync.use_jdbc=false --target-base-path s3a://redact/my2/multpk3 --target-table dmstest_multpk3 --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company --hoodie-conf hoodie.datasource.write.partitionpath.field=sys_user --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/redact > multpk3.log i do have https://github.com/apache/hudi/pull/1898 patched in this jar Instead of getting 1 row per version_no,group_company combo, I am getting multiple rows per version_no,group_company combo, in fact i am getting 1 row per version_no,group_company,sys_user combo How to make it not treat partition field as part of pk? ie for each version_no,group_company combo, i want to get the latest row by TimeCreated (ie the source-ordering-field) and then partition on whatever sys_user that latest row has. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org