tooptoop4 opened a new issue #1954: URL: https://github.com/apache/hudi/issues/1954
i'm loading data from DMS and i don't want any partitions (i did not specify hoodie.datasource.hive_sync.partition_fields since website says can leave default empty) ``` /home/ec2-user/spark_home/bin/spark-submit --conf "spark.hadoop.fs.s3a.proxy.host=redact" --conf "spark.hadoop.fs.s3a.proxy.port=redact" --conf "spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf "spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars "/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 --deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar --table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync --hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf hoodie.datasource.hive_sync.table=dmstest_multpk4 --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.MultiPartKeysValueExtractor --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false --target-base-path s3a://redact/my2/multpk4 --tar get-table dmstest_multpk4 --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company --hoodie-conf hoodie.datasource.write.partitionpath.field=sys_user --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tblhere > multpk4.log ``` ``` 2020-08-12 11:31:11,186 [main] INFO org.apache.hudi.client.AbstractHoodieWriteClient - Committed 20200812112840 2020-08-12 11:31:11,189 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Commit 20200812112840 successful! 2020-08-12 11:31:11,194 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Syncing target hoodie table with hive table(dmstest_multpk4). Hive metastore URL :jdbc:hive2://localhost:10000, basePath :s3a://redact/my2/multpk4 2020-08-12 11:31:11,960 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812112840__commit__COMPLETED]] 2020-08-12 11:31:14,264 [main] INFO org.apache.hudi.hive.HiveSyncTool - Trying to sync hoodie table dmstest_multpk4 with base path s3a://redact/my2/multpk4 of type COPY_ON_WRITE 2020-08-12 11:31:14,707 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Reading schema from s3a://redact/my2/multpk4/mpark2/7ed7627c-6110-4d42-9df2-f3a6afe877df-0_187-25-15737_20200812112840.parquet 2020-08-12 11:31:15,330 [main] INFO org.apache.hudi.hive.HiveSyncTool - Hive table dmstest_multpk4 is not found. Creating it 2020-08-12 11:31:15,337 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Creating table with CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk4`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname` string, `org_mnem` string, `org_parent` int, `percent_holding` double, `group_company` string, `grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string, `alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes` string, `active` string, `version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint, `cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client` string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://redact/my2/multpk4' 2020-08-12 11:31:15,411 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Time taken to start SessionState and create Driver: 74 ms 2020-08-12 11:31:15,444 [main] INFO hive.ql.parse.ParseDriver - Parsing command: CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk4`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname` string, `org_mnem` string, `org_parent` int, `percent_holding` double, `group_company` string, `grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string, `alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes` string, `active` string, `version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint, `cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client` string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FORMAT SERDE 'org.apa che.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://redact/my2/multpk4' 2020-08-12 11:31:16,131 [main] INFO hive.ql.parse.ParseDriver - Parse Completed 2020-08-12 11:31:16,568 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Time taken to execute [CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk4`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname` string, `org_mnem` string, `org_parent` int, `percent_holding` double, `group_company` string, `grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string, `alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes` string, `active` string, `version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint, `cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client` string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FOR MAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://redact/my2/multpk4']: 1157 ms 2020-08-12 11:31:16,574 [main] INFO org.apache.hudi.hive.HiveSyncTool - Schema sync complete. Syncing partitions for dmstest_multpk4 2020-08-12 11:31:16,574 [main] INFO org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be null 2020-08-12 11:31:16,575 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Last commit time synced is not known, listing all partitions in s3a://redact/my2/multpk4,FS :S3AFileSystem{uri=s3a://redact, workingDir=s3a://redact/user/ec2-user, inputPolicy=normal, partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, blockSize=33554432, multiPartThreshold=2147483647, serverSideEncryptionAlgorithm='AES256', blockFactory=org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@62765aec, boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=2405, available=2405, waiting=0}, activeCount=0}, unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@6f5bd362[Running, pool size = 6, active threads = 0, queued tasks = 0, completed tasks = 6], statistics {761530 bytes read, 320081 bytes written, 712 read ops, 0 large read ops, 31 write ops}, metrics {{Context=S3AFileSystem} {FileSystemId=db54a51b-e05e-4b3c-9140-240762a0c03d-redact } {fsURI=s3a://redact/redact/sparkevents} {files_created=5} {files_copied=0} {files_copied_bytes=0} {files_deleted=271} {fake_directories_deleted=0} {directories_created=6} {directories_deleted=0} {ignored_errors=4} {op_copy_from_local_file=0} {op_exists=53} {op_get_file_status=415} {op_glob_status=0} {op_is_directory=38} {op_is_file=0} {op_list_files=271} {op_list_located_status=0} {op_list_status=19} {op_mkdirs=5} {op_rename=0} {object_copy_requests=0} {object_delete_requests=5} {object_list_requests=680} {object_continue_list_requests=0} {object_metadata_requests=805} {object_multipart_aborted=0} {object_put_bytes=320081} {object_put_requests=10} {object_put_requests_completed=10} {stream_write_failures=0} {stream_write_block_uploads=0} {stream_write_block_uploads_committed=0} {stream_write_block_uploads_aborted=0} {stream_write_total_time=0} {stream_write_total_data=320081} {object_put_requests_active=0} {object_put_bytes_pending=0} {stream_write_block_uploads_active=0} {stream_ write_block_uploads_pending=4} {stream_write_block_uploads_data_pending=0} {stream_read_fully_operations=0} {stream_opened=22} {stream_bytes_skipped_on_seek=0} {stream_closed=22} {stream_bytes_backwards_on_seek=437965} {stream_bytes_read=761530} {stream_read_operations_incomplete=107} {stream_bytes_discarded_in_abort=0} {stream_close_operations=22} {stream_read_operations=3020} {stream_aborted=0} {stream_forward_seek_operations=0} {stream_backward_seek_operations=1} {stream_seek_operations=1} {stream_bytes_read_in_close=8} {stream_read_exceptions=0} }} 2020-08-12 11:31:34,438 [main] INFO org.apache.hudi.hive.HiveSyncTool - Storage partitions scan complete. Found 271 2020-08-12 11:31:34,476 [main] INFO org.apache.hudi.hive.HiveSyncTool - New Partitions [AAB, redactlist] 2020-08-12 11:31:34,476 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Adding partitions 271 to table dmstest_multpk4 2020-08-12 11:31:34,477 [main] ERROR org.apache.hudi.hive.HiveSyncTool - Got runtime exception when hive syncing org.apache.hudi.hive.HoodieHiveSyncException: Failed to sync partitions for table dmstest_multpk4 at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:187) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:126) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:87) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncHive(DeltaSync.java:460) at org.apache.hudi.utilities.deltastreamer.DeltaSync.writeToSink(DeltaSync.java:402) at org.apache.hudi.utilities.deltastreamer.DeltaSync.syncOnce(DeltaSync.java:235) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.sync(HoodieDeltaStreamer.java:123) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:380) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.IllegalArgumentException: Partition key parts [] does not match with partition values [AAB]. Check partition strategy. at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:40) at org.apache.hudi.hive.HoodieHiveClient.getPartitionClause(HoodieHiveClient.java:182) at org.apache.hudi.hive.HoodieHiveClient.constructAddPartitions(HoodieHiveClient.java:166) at org.apache.hudi.hive.HoodieHiveClient.addPartitionsToTable(HoodieHiveClient.java:141) at org.apache.hudi.hive.HiveSyncTool.syncPartitions(HiveSyncTool.java:182) ... 19 more 2020-08-12 11:31:34,513 [main] INFO org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Shut down deltastreamer 2020-08-12 11:31:34,535 [main] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down all executors ``` ``` aws s3 ls s3://redact/my2/multpk4/ PRE .hoodie/ PRE AAB/ PRE CC/ PRE DD/ ...etc ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org