tooptoop4 opened a new issue #1955:
URL: https://github.com/apache/hudi/issues/1955


   /home/ec2-user/spark_home/bin/spark-submit --conf 
"spark.hadoop.fs.s3a.proxy.host=redact" --conf 
"spark.hadoop.fs.s3a.proxy.port=redact" --conf 
"spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf 
"spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class 
org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars 
"/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 
--deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar 
--table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class 
org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync 
--hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf 
hoodie.datasource.hive_sync.table=dmstest_multpk3 --hoodie-conf 
hoodie.datasource.hive_sync.partition_fields="sys_user" --hoodie-conf 
hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor
 --hoodie-conf  hoodie.datasource.hive
 _sync.use_jdbc=false --target-base-path s3a://redact/my2/multpk3 
--target-table dmstest_multpk3 --transformer-class 
org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class 
org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf 
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator
 --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company 
--hoodie-conf hoodie.datasource.write.partitionpath.field=sys_user 
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/redact > 
multpk3.log
   
   i do have https://github.com/apache/hudi/pull/1898 patched in this jar
   
   Instead of getting 1 row per version_no,group_company combo, I am getting 
multiple rows per version_no,group_company combo, in fact i am getting 1 row 
per version_no,group_company,sys_user combo
   
   How to make it not treat partition field as part of pk?
   
   ie for each version_no,group_company combo, i want to get the latest row by 
TimeCreated (ie the source-ordering-field) and then partition on whatever 
sys_user that latest row has.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to