tooptoop4 commented on issue #1954: URL: https://github.com/apache/hudi/issues/1954#issuecomment-672843556
got a bit further with the below, now hudi/spark job succeeds but the hive ddl is pointing at wrong s3 location, so doing select from hive/presto gives error. But when i manually alter the s3 location in the table ddl via hiveserver2 then it works (ie change LOCATION 's3a://redact/my2/multpk7' to LOCATION 's3a://redact/my2/multpk7/default'), so i think there should be some code change to make it create table at proper s3 location. ``` /home/ec2-user/spark_home/bin/spark-submit --conf "spark.hadoop.fs.s3a.proxy.host=redact" --conf "spark.hadoop.fs.s3a.proxy.port=redact" --conf "spark.driver.extraClassPath=/home/ec2-user/json-20090211.jar" --conf "spark.executor.extraClassPath=/home/ec2-user/json-20090211.jar" --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer --jars "/home/ec2-user/spark-avro_2.11-2.4.6.jar" --master spark://redact:7077 --deploy-mode client /home/ec2-user/hudi-utilities-bundle_2.11-0.5.3-1.jar --table-type COPY_ON_WRITE --source-ordering-field TimeCreated --source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-hive-sync --hoodie-conf hoodie.datasource.hive_sync.database=redact --hoodie-conf hoodie.datasource.hive_sync.table=dmstest_multpk7 --hoodie-conf hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor --hoodie-conf hoodie.datasource.hive_sync.use_jdbc=false --target-base-path s3a://redact/my2/multpk7 --target- table dmstest_multpk7 --transformer-class org.apache.hudi.utilities.transform.AWSDmsTransformer --payload-class org.apache.hudi.payload.AWSDmsAvroPayload --hoodie-conf hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator --hoodie-conf hoodie.datasource.write.recordkey.field=version_no,group_company --hoodie-conf "hoodie.datasource.write.partitionpath.field=" --hoodie-conf hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tbl > multpk7.log OK ``` cat multpk7.log ``` 2020-08-12 12:18:15,375 [main] WARN org.apache.hudi.utilities.deltastreamer.SchedulerConfGenerator - Job Scheduling Configs will not be in effect as spark.scheduler.mode is not set to FAIR at instantiation time. Continuing without scheduling configs 2020-08-12 12:18:16,386 [dispatcher-event-loop-3] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Connected to Spark cluster with app ID app-20200812121816-0086 2020-08-12 12:18:17,199 [main] INFO com.amazonaws.http.AmazonHttpClient - Configuring Proxy. redact 2020-08-12 12:18:18,154 [main] INFO org.apache.spark.scheduler.EventLoggingListener - Logging events to s3a://redact/sparkevents/app-20200812121816-0086 2020-08-12 12:18:18,171 [dispatcher-event-loop-2] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200812121816-0086/0 on hostPort redact:19629 with 4 core(s), 7.9 GB RAM 2020-08-12 12:18:18,195 [main] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0 2020-08-12 12:18:18,427 [main] WARN org.apache.spark.SparkContext - Using an existing SparkContext; some configuration may not take effect. 2020-08-12 12:18:18,526 [main] ERROR org.apache.hudi.common.util.DFSPropertiesConfiguration - Error reading in properies from dfs java.io.FileNotFoundException: File file:/home/ec2-user/http_listener/logs/src/test/resources/delta-streamer-config/dfs-source.properties does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:787) at org.apache.hudi.common.util.DFSPropertiesConfiguration.visitFile(DFSPropertiesConfiguration.java:87) at org.apache.hudi.common.util.DFSPropertiesConfiguration.<init>(DFSPropertiesConfiguration.java:60) at org.apache.hudi.common.util.DFSPropertiesConfiguration.<init>(DFSPropertiesConfiguration.java:64) at org.apache.hudi.utilities.UtilHelpers.readConfig(UtilHelpers.java:118) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:451) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:91) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:380) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 2020-08-12 12:18:18,528 [main] WARN org.apache.hudi.utilities.UtilHelpers - Unexpected error read props file at :file:/home/ec2-user/http_listener/logs/src/test/resources/delta-streamer-config/dfs-source.properties java.lang.IllegalArgumentException: Cannot read properties from dfs at org.apache.hudi.common.util.DFSPropertiesConfiguration.visitFile(DFSPropertiesConfiguration.java:91) at org.apache.hudi.common.util.DFSPropertiesConfiguration.<init>(DFSPropertiesConfiguration.java:60) at org.apache.hudi.common.util.DFSPropertiesConfiguration.<init>(DFSPropertiesConfiguration.java:64) at org.apache.hudi.utilities.UtilHelpers.readConfig(UtilHelpers.java:118) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer$DeltaSyncService.<init>(HoodieDeltaStreamer.java:451) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:97) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.<init>(HoodieDeltaStreamer.java:91) at org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer.main(HoodieDeltaStreamer.java:380) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:845) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:920) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:929) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: File file:/home/ec2-user/http_listener/logs/src/test/resources/delta-streamer-config/dfs-source.properties does not exist at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:635) at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:861) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:625) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:442) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.<init>(ChecksumFileSystem.java:146) at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:347) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:787) at org.apache.hudi.common.util.DFSPropertiesConfiguration.visitFile(DFSPropertiesConfiguration.java:87) ... 19 more 2020-08-12 12:18:18,528 [main] INFO org.apache.hudi.utilities.UtilHelpers - Adding overridden properties to file properties. 2020-08-12 12:18:18,529 [main] INFO org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Creating delta streamer with configs : {hoodie.datasource.hive_sync.use_jdbc=false, hoodie.datasource.write.recordkey.field=version_no,group_company, hoodie.datasource.write.partitionpath.field=, hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator, hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor, hoodie.datasource.hive_sync.table=dmstest_multpk7, hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tbl, hoodie.datasource.hive_sync.database=redact} 2020-08-12 12:18:18,533 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Creating delta streamer with configs : {hoodie.datasource.hive_sync.use_jdbc=false, hoodie.datasource.write.recordkey.field=version_no,group_company, hoodie.datasource.write.partitionpath.field=, hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.ComplexKeyGenerator, hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor, hoodie.datasource.hive_sync.table=dmstest_multpk7, hoodie.deltastreamer.source.dfs.root=s3a://redact/dbo/tbl, hoodie.datasource.hive_sync.database=redact} 2020-08-12 12:18:19,798 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Setting up Hoodie Write Client 2020-08-12 12:18:19,799 [main] INFO org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Delta Streamer running only single round 2020-08-12 12:18:20,218 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [] 2020-08-12 12:18:20,222 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Checkpoint to resume from : Option{val=null} 2020-08-12 12:18:42,136 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Setting up Hoodie Write Client 2020-08-12 12:18:42,156 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Registering Schema :[{"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields":[{"name":"Op","type":["string","null"]},{"name":"Id","type":["int","null"]},{"name":"AuditProcessHistoryId","type":["int","null"]},{"name":"org_id","type":["int","null"]},{"name":"org_name","type":["string","null"]},{"name":"org_sname","type":["string","null"]},{"name":"org_mnem","type":["string","null"]},{"name":"org_parent","type":["int","null"]},{"name":"percent_holding","type":["double","null"]},{"name":"group_company","type":["string","null"]},{"name":"grp_ord_for_cln","type":["string","null"]},{"name":"mkt_only","type":["string","null"]},{"name":"pro_rate_ind","type":["string","null"]},{"name":"show_shapes","type":["string","null"]},{"name":"sec_code_pref","type":["string","null"]},{"name":"alert_org_ref","type":["string","null"]},{"name":"swift_bic","type":["string","null"]},{"name":"exec_b reakdown","type":["string","null"]},{"name":"notes","type":["string","null"]},{"name":"active","type":["string","null"]},{"name":"version_no","type":["int","null"]},{"name":"sys_date","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},{"name":"sys_user","type":["string","null"]},{"name":"create_date","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},{"name":"cntry_of_dom","type":["string","null"]},{"name":"client","type":["string","null"]},{"name":"alert_acronym","type":["string","null"]},{"name":"oneoff_client","type":["string","null"]},{"name":"booking_domicile","type":["string","null"]},{"name":"booking_dom_list","type":["string","null"]},{"name":"TimeCreated","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},{"name":"UserCreated","type":["string","null"]}]}, {"type":"record","name":"hoodie_source","namespace":"hoodie.source","fields":[{"name":"Op","type":["string","null"]},{"name":"Id","type":["int","null"]},{"name":"AuditProcessHis toryId","type":["int","null"]},{"name":"org_id","type":["int","null"]},{"name":"org_name","type":["string","null"]},{"name":"org_sname","type":["string","null"]},{"name":"org_mnem","type":["string","null"]},{"name":"org_parent","type":["int","null"]},{"name":"percent_holding","type":["double","null"]},{"name":"group_company","type":["string","null"]},{"name":"grp_ord_for_cln","type":["string","null"]},{"name":"mkt_only","type":["string","null"]},{"name":"pro_rate_ind","type":["string","null"]},{"name":"show_shapes","type":["string","null"]},{"name":"sec_code_pref","type":["string","null"]},{"name":"alert_org_ref","type":["string","null"]},{"name":"swift_bic","type":["string","null"]},{"name":"exec_breakdown","type":["string","null"]},{"name":"notes","type":["string","null"]},{"name":"active","type":["string","null"]},{"name":"version_no","type":["int","null"]},{"name":"sys_date","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},{"name":"sys_user","type":["string","nu ll"]},{"name":"create_date","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},{"name":"cntry_of_dom","type":["string","null"]},{"name":"client","type":["string","null"]},{"name":"alert_acronym","type":["string","null"]},{"name":"oneoff_client","type":["string","null"]},{"name":"booking_domicile","type":["string","null"]},{"name":"booking_dom_list","type":["string","null"]},{"name":"TimeCreated","type":[{"type":"long","logicalType":"timestamp-micros"},"null"]},{"name":"UserCreated","type":["string","null"]}]}] 2020-08-12 12:18:50,361 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [] 2020-08-12 12:18:50,934 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [] 2020-08-12 12:18:50,937 [main] INFO org.apache.hudi.client.HoodieWriteClient - Generate a new instant time 20200812121850 2020-08-12 12:18:51,226 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [] 2020-08-12 12:18:51,234 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Creating a new instant [==>20200812121850__commit__REQUESTED] 2020-08-12 12:18:51,415 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Starting commit : 20200812121850 2020-08-12 12:18:51,699 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__REQUESTED]] 2020-08-12 12:18:51,982 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__REQUESTED]] 2020-08-12 12:19:21,501 [main] INFO org.apache.hudi.index.bloom.HoodieBloomIndex - InputParallelism: ${1500}, IndexParallelism: ${0} 2020-08-12 12:19:32,817 [main] INFO org.apache.hudi.client.HoodieWriteClient - Workload profile :WorkloadProfile {globalStat=WorkloadStat {numInserts=103, numUpdates=0}, partitionStat={default=WorkloadStat {numInserts=103, numUpdates=0}}} 2020-08-12 12:19:32,841 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Checking for file exists ?s3a://redact/my2/multpk7/.hoodie/20200812121850.commit.requested 2020-08-12 12:19:33,081 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Create new file for toInstant ?s3a://redact/my2/multpk7/.hoodie/20200812121850.inflight 2020-08-12 12:19:33,082 [main] INFO org.apache.hudi.table.HoodieCopyOnWriteTable - AvgRecordSize => 1024 2020-08-12 12:19:33,184 [main] INFO org.apache.hudi.table.HoodieCopyOnWriteTable - For partitionPath : default Small Files => [] 2020-08-12 12:19:33,184 [main] INFO org.apache.hudi.table.HoodieCopyOnWriteTable - After small file assignment: unassignedInserts => 103, totalInsertBuckets => 1, recordsPerBucket => 122880 2020-08-12 12:19:33,185 [main] INFO org.apache.hudi.table.HoodieCopyOnWriteTable - Total insert buckets for partition path default => [WorkloadStat {bucketNumber=0, weight=1.0}] 2020-08-12 12:19:33,186 [main] INFO org.apache.hudi.table.HoodieCopyOnWriteTable - Total Buckets :1, buckets info => {0=BucketInfo {bucketType=INSERT, fileIdPrefix=a9ab6f7a-4def-490a-aac0-49e15ee9d742}}, Partition to insert buckets => {default=[WorkloadStat {bucketNumber=0, weight=1.0}]}, UpdateLocations mapped to buckets =>{} 2020-08-12 12:19:33,206 [main] INFO org.apache.hudi.client.AbstractHoodieWriteClient - Auto commit disabled for 20200812121850 2020-08-12 12:19:41,179 [main] INFO org.apache.hudi.client.AbstractHoodieWriteClient - Commiting 20200812121850 2020-08-12 12:19:41,502 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__INFLIGHT]] 2020-08-12 12:19:41,777 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__INFLIGHT]] 2020-08-12 12:19:42,140 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__INFLIGHT]] 2020-08-12 12:19:42,479 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__INFLIGHT]] 2020-08-12 12:19:42,706 [main] INFO org.apache.hudi.table.HoodieTable - Removing marker directory=s3a://redact/my2/multpk7/.hoodie/.temp/20200812121850 2020-08-12 12:19:43,027 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Marking instant complete [==>20200812121850__commit__INFLIGHT] 2020-08-12 12:19:43,027 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Checking for file exists ?s3a://redact/my2/multpk7/.hoodie/20200812121850.inflight 2020-08-12 12:19:43,356 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Create new file for toInstant ?s3a://redact/my2/multpk7/.hoodie/20200812121850.commit 2020-08-12 12:19:43,357 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Completed [==>20200812121850__commit__INFLIGHT] 2020-08-12 12:19:43,745 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:44,010 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:44,084 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[==>20200812121850__commit__REQUESTED], [==>20200812121850__commit__INFLIGHT], [20200812121850__commit__COMPLETED]] 2020-08-12 12:19:44,085 [main] INFO org.apache.hudi.table.HoodieCommitArchiveLog - No Instants to archive 2020-08-12 12:19:44,086 [main] INFO org.apache.hudi.client.HoodieWriteClient - Auto cleaning is enabled. Running cleaner now 2020-08-12 12:19:44,356 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:44,629 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:44,912 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:45,321 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:45,337 [main] INFO org.apache.hudi.table.CleanHelper - No earliest commit to retain. No need to scan partitions !! 2020-08-12 12:19:45,337 [main] INFO org.apache.hudi.table.HoodieCopyOnWriteTable - Nothing to clean here. It is already clean 2020-08-12 12:19:45,374 [main] INFO org.apache.hudi.client.AbstractHoodieWriteClient - Committed 20200812121850 2020-08-12 12:19:45,374 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Commit 20200812121850 successful! 2020-08-12 12:19:45,375 [main] INFO org.apache.hudi.utilities.deltastreamer.DeltaSync - Syncing target hoodie table with hive table(dmstest_multpk7). Hive metastore URL :jdbc:hive2://localhost:10000, basePath :s3a://redact/my2/multpk7 2020-08-12 12:19:45,636 [main] INFO org.apache.hudi.common.table.timeline.HoodieActiveTimeline - Loaded instants [[20200812121850__commit__COMPLETED]] 2020-08-12 12:19:46,806 [main] INFO org.apache.hudi.hive.HiveSyncTool - Trying to sync hoodie table dmstest_multpk7 with base path s3a://redact/my2/multpk7 of type COPY_ON_WRITE 2020-08-12 12:19:46,864 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Reading schema from s3a://redact/my2/multpk7/default/a9ab6f7a-4def-490a-aac0-49e15ee9d742-0_0-25-15010_20200812121850.parquet 2020-08-12 12:19:47,064 [main] INFO org.apache.hudi.hive.HiveSyncTool - Hive table dmstest_multpk7 is not found. Creating it 2020-08-12 12:19:47,070 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Creating table with CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk7`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname` string, `org_mnem` string, `org_parent` int, `percent_holding` double, `group_company` string, `grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string, `alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes` string, `active` string, `version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint, `cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client` string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://redact/my2/multpk7' 2020-08-12 12:19:47,151 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Time taken to start SessionState and create Driver: 81 ms 2020-08-12 12:19:47,186 [main] INFO hive.ql.parse.ParseDriver - Parsing command: CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk7`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname` string, `org_mnem` string, `org_parent` int, `percent_holding` double, `group_company` string, `grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string, `alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes` string, `active` string, `version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint, `cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client` string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FORMAT SERDE 'org.apa che.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://redact/my2/multpk7' 2020-08-12 12:19:47,874 [main] INFO hive.ql.parse.ParseDriver - Parse Completed 2020-08-12 12:19:48,323 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Time taken to execute [CREATE EXTERNAL TABLE IF NOT EXISTS `redact`.`dmstest_multpk7`( `_hoodie_commit_time` string, `_hoodie_commit_seqno` string, `_hoodie_record_key` string, `_hoodie_partition_path` string, `_hoodie_file_name` string, `Op` string, `Id` int, `AuditProcessHistoryId` int, `org_id` int, `org_name` string, `org_sname` string, `org_mnem` string, `org_parent` int, `percent_holding` double, `group_company` string, `grp_ord_for_cln` string, `mkt_only` string, `pro_rate_ind` string, `show_shapes` string, `sec_code_pref` string, `alert_org_ref` string, `swift_bic` string, `exec_breakdown` string, `notes` string, `active` string, `version_no` int, `sys_date` bigint, `sys_user` string, `create_date` bigint, `cntry_of_dom` string, `client` string, `alert_acronym` string, `oneoff_client` string, `booking_domicile` string, `booking_dom_list` string, `TimeCreated` bigint, `UserCreated` string) ROW FOR MAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'org.apache.hudi.hadoop.HoodieParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' LOCATION 's3a://redact/my2/multpk7']: 1171 ms 2020-08-12 12:19:48,329 [main] INFO org.apache.hudi.hive.HiveSyncTool - Schema sync complete. Syncing partitions for dmstest_multpk7 2020-08-12 12:19:48,329 [main] INFO org.apache.hudi.hive.HiveSyncTool - Last commit time synced was found to be null 2020-08-12 12:19:48,330 [main] INFO org.apache.hudi.hive.HoodieHiveClient - Last commit time synced is not known, listing all partitions in s3a://redact/my2/multpk7,FS :S3AFileSystem{uri=s3a://redact, workingDir=s3a://redact/user/ec2-user, inputPolicy=normal, partSize=104857600, enableMultiObjectsDelete=true, maxKeys=5000, readAhead=65536, blockSize=33554432, multiPartThreshold=2147483647, serverSideEncryptionAlgorithm='AES256', blockFactory=org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@62765aec, boundedExecutor=BlockingThreadPoolExecutorService{SemaphoredDelegatingExecutor{permitCount=2405, available=2405, waiting=0}, activeCount=0}, unboundedExecutor=java.util.concurrent.ThreadPoolExecutor@6f5bd362[Running, pool size = 6, active threads = 0, queued tasks = 0, completed tasks = 6], statistics {445890 bytes read, 4324 bytes written, 172 read ops, 0 large read ops, 31 write ops}, metrics {{Context=S3AFileSystem} {FileSystemId=aad8f6ce-2b40-4ddb-9b9b-4e82033cb193-redact} {fsURI=s3a://redact/sparkevents} {files_created=5} {files_copied=0} {files_copied_bytes=0} {files_deleted=1} {fake_directories_deleted=0} {directories_created=6} {directories_deleted=0} {ignored_errors=4} {op_copy_from_local_file=0} {op_exists=53} {op_get_file_status=145} {op_glob_status=0} {op_is_directory=38} {op_is_file=0} {op_list_files=1} {op_list_located_status=0} {op_list_status=19} {op_mkdirs=5} {op_rename=0} {object_copy_requests=0} {object_delete_requests=5} {object_list_requests=140} {object_continue_list_requests=0} {object_metadata_requests=265} {object_multipart_aborted=0} {object_put_bytes=4324} {object_put_requests=10} {object_put_requests_completed=10} {stream_write_failures=0} {stream_write_block_uploads=0} {stream_write_block_uploads_committed=0} {stream_write_block_uploads_aborted=0} {stream_write_total_time=0} {stream_write_total_data=4324} {object_put_requests_active=0} {object_put_bytes_pending=0} {stream_write_block_uploads_active=0} {stream_write_block_uploa ds_pending=4} {stream_write_block_uploads_data_pending=0} {stream_read_fully_operations=0} {stream_opened=22} {stream_bytes_skipped_on_seek=0} {stream_closed=22} {stream_bytes_backwards_on_seek=438082} {stream_bytes_read=445890} {stream_read_operations_incomplete=71} {stream_bytes_discarded_in_abort=0} {stream_close_operations=22} {stream_read_operations=2764} {stream_aborted=0} {stream_forward_seek_operations=0} {stream_backward_seek_operations=1} {stream_seek_operations=1} {stream_bytes_read_in_close=8} {stream_read_exceptions=0} }} 2020-08-12 12:19:48,584 [main] INFO org.apache.hudi.hive.HiveSyncTool - Storage partitions scan complete. Found 1 2020-08-12 12:19:48,613 [main] INFO org.apache.hudi.hive.HiveSyncTool - New Partitions [] 2020-08-12 12:19:48,614 [main] INFO org.apache.hudi.hive.HoodieHiveClient - No partitions to add for dmstest_multpk7 2020-08-12 12:19:48,614 [main] INFO org.apache.hudi.hive.HiveSyncTool - Changed Partitions [] 2020-08-12 12:19:48,614 [main] INFO org.apache.hudi.hive.HoodieHiveClient - No partitions to change for dmstest_multpk7 2020-08-12 12:19:49,002 [main] INFO org.apache.hudi.hive.HiveSyncTool - Sync complete for dmstest_multpk7 2020-08-12 12:19:49,031 [main] INFO org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer - Shut down deltastreamer 2020-08-12 12:19:49,044 [main] INFO org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down all executors ``` ``` aws s3 ls s3://redact/my2/multpk7/ PRE .hoodie/ PRE default/ aws s3 ls s3://redact/my2/multpk7/default/ 2020-08-12 12:19:39 93 .hoodie_partition_metadata 2020-08-12 12:19:41 452644 a9ab6f7a-4def-490a-aac0-49e15ee9d742-0_0-25-15010_20200812121850.parquet ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org