soumilshah1995 opened a new issue, #7451: URL: https://github.com/apache/hudi/issues/7451
Hello i am using glue 4.0 and i am trying to make a tutorial for community. the title is move data from dynamodb into Apache Hudi. ##### Table ![image](https://user-images.githubusercontent.com/39345855/207491290-6a23b4ce-1ed9-4e2c-8839-03afdb11b883.png) ##### Sample JSON ``` { "id": { "S": "bbbe96f6-f762-4750-a946-a5e0dc5d0fa5" }, "address": { "S": "361 Terri Rapids\nDonnaberg, ID 55397" }, "city": { "S": "361 Terri Rapids\nDonnaberg, ID 55397" }, "first_name": { "S": "Melissa" }, "last_name": { "S": "Frederick" }, "state": { "S": "Miss what personal door country energy rate. Court school its indicate. Remember finally because debate role hospital appear." }, "text": { "S": "Miss what personal door country energy rate. Court school its indicate. Remember finally because debate role hospital appear." } } ``` #### Glue table ![image](https://user-images.githubusercontent.com/39345855/207491453-876af9d6-2be9-4cd2-a1d5-8c0e5b6e9565.png) ``` import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job args = getResolvedOptions(sys.argv, ["JOB_NAME"]) sc = SparkContext() glueContext = GlueContext(sc) spark = glueContext.spark_session job = Job(glueContext) job.init(args["JOB_NAME"], args) # Script generated for node AWS Glue Data Catalog AWSGlueDataCatalog_node1670979530578 = glueContext.create_dynamic_frame.from_catalog( database="dev.dynamodbdb", table_name="dev_users", transformation_ctx="AWSGlueDataCatalog_node1670979530578", ) # Script generated for node Rename Field RenameField_node1670981729330 = RenameField.apply( frame=AWSGlueDataCatalog_node1670979530578, old_name="id", new_name="pk", transformation_ctx="RenameField_node1670981729330", ) # Script generated for node Change Schema (Apply Mapping) ChangeSchemaApplyMapping_node1670981753064 = ApplyMapping.apply( frame=RenameField_node1670981729330, mappings=[ ("address", "string", "address", "string"), ("city", "string", "city", "string"), ("last_name", "string", "last_name", "string"), ("text", "string", "text", "string"), ("pk", "string", "pk", "string"), ("state", "string", "state", "string"), ("first_name", "string", "first_name", "string"), ], transformation_ctx="ChangeSchemaApplyMapping_node1670981753064", ) additional_options={ "hoodie.datasource.hive_sync.database": "hudidb", "hoodie.table.name": "hudi_table", "hoodie.datasource.hive_sync.table": "hudi_table", "hoodie.datasource.write.storage.type": "COPY_ON_WRITE", "hoodie.datasource.write.operation": "upsert", "hoodie.datasource.write.recordkey.field": "pk", "hoodie.datasource.write.precombine.field": "pk", "hoodie.combine.before.delete":"false", "hoodie.datasource.write.hive_style_partitioning": "true", "hoodie.datasource.hive_sync.enable": "true", 'hoodie.datasource.hive_sync.sync_as_datasource': 'false', "hoodie.datasource.hive_sync.use_jdbc": "false", 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', "hoodie.datasource.hive_sync.mode": "hms", "path": "s3://glue-learn-begineers/data/" } df = ChangeSchemaApplyMapping_node1670981753064.toDF() df.write.format("hudi").options(**additional_options).mode("overwrite").save() job.commit() ``` #### Error Message An error occurred while calling o138.save. Failed to upsert for commit time 20221214022409792 ##### Detailed Logs ``` here are older events to load. Load more. 2022-12-14 02:24:16,947 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing modify acls to: spark 2022-12-14 02:24:16,948 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing view acls groups to: 2022-12-14 02:24:16,949 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing modify acls groups to: 2022-12-14 02:24:16,949 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set() 2022-12-14 02:24:17,452 INFO [netty-rpc-connection-0] client.TransportClientFactory (TransportClientFactory.java:createClient(310)): Successfully created connection to /172.35.86.171:37187 after 201 ms (126 ms spent in bootstraps) 2022-12-14 02:24:17,552 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing view acls to: spark 2022-12-14 02:24:17,552 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing modify acls to: spark 2022-12-14 02:24:17,553 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing view acls groups to: 2022-12-14 02:24:17,553 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): Changing modify acls groups to: 2022-12-14 02:24:17,556 INFO [main] spark.SecurityManager (Logging.scala:logInfo(61)): SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(spark); groups with view permissions: Set(); users with modify permissions: Set(spark); groups with modify permissions: Set() 2022-12-14 02:24:17,649 INFO [netty-rpc-connection-0] client.TransportClientFactory (TransportClientFactory.java:createClient(310)): Successfully created connection to /172.35.86.171:37187 after 12 ms (6 ms spent in bootstraps) 2022-12-14 02:24:17,727 INFO [main] storage.DiskBlockManager (Logging.scala:logInfo(61)): Created local directory at /tmp/blockmgr-a35a422c-50cc-41bd-a5b6-404abcb60907 2022-12-14 02:24:17,746 INFO [main] memory.MemoryStore (Logging.scala:logInfo(61)): MemoryStore started with capacity 5.8 GiB 2022-12-14 02:24:17,813 INFO [main] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:<init>(53)): GlueCloudwatchSink: get cloudwatch client using proxy: host null, port -1 2022-12-14 02:24:17,896 INFO [main] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: Obtained credentials from the Instance Profile 2022-12-14 02:24:17,943 INFO [main] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: jobName: hudi-glue-4 jobRunId: jr_b30b8eca9bb8e2c66af24af7fdee89dc3935906b7d44cd20791bc373bd250065 2022-12-14 02:24:18,006 INFO [main] subresultcache.SubResultCacheManager (Logging.scala:logInfo(61)): Sub-result caches are disabled. 2022-12-14 02:24:18,025 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Connecting to driver: spark://CoarseGrainedScheduler@172.35.86.171:37187 2022-12-14 02:24:18,548 INFO [dispatcher-Executor] resource.ResourceUtils (Logging.scala:logInfo(61)): ============================================================== 2022-12-14 02:24:18,549 INFO [dispatcher-Executor] resource.ResourceUtils (Logging.scala:logInfo(61)): No custom resources configured for spark.executor. 2022-12-14 02:24:18,550 INFO [dispatcher-Executor] resource.ResourceUtils (Logging.scala:logInfo(61)): ============================================================== 2022-12-14 02:24:18,604 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Successfully registered with driver 2022-12-14 02:24:18,613 INFO [dispatcher-Executor] executor.Executor (Logging.scala:logInfo(61)): Starting executor ID 1 on host 172.35.38.197 2022-12-14 02:24:18,682 INFO [dispatcher-Executor] util.Utils (Logging.scala:logInfo(61)): Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39803. 2022-12-14 02:24:18,683 INFO [dispatcher-Executor] netty.NettyBlockTransferService (NettyBlockTransferService.scala:init(82)): Server created on 172.35.38.197:39803 2022-12-14 02:24:18,685 INFO [dispatcher-Executor] storage.BlockManager (Logging.scala:logInfo(61)): Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy 2022-12-14 02:24:18,695 INFO [dispatcher-Executor] storage.BlockManagerMaster (Logging.scala:logInfo(61)): Registering BlockManager BlockManagerId(1, 172.35.38.197, 39803, None) 2022-12-14 02:24:18,707 INFO [dispatcher-Executor] storage.BlockManagerMaster (Logging.scala:logInfo(61)): Registered BlockManager BlockManagerId(1, 172.35.38.197, 39803, None) 2022-12-14 02:24:18,708 INFO [dispatcher-Executor] storage.BlockManager (Logging.scala:logInfo(61)): Initialized BlockManager: BlockManagerId(1, 172.35.38.197, 39803, None) 2022-12-14 02:24:18,718 INFO [dispatcher-Executor] executor.Executor (Logging.scala:logInfo(61)): Starting executor with user classpath (userClassPathFirst = false): '' 2022-12-14 02:24:18,905 INFO [pool-11-thread-3] util.PlatformInfo (PlatformInfo.java:getJobFlowId(56)): Unable to read clusterId from http://localhost:8321/configuration, trying extra instance data file: /var/lib/instance-controller/extraInstanceData.json 2022-12-14 02:24:18,906 INFO [pool-11-thread-3] util.PlatformInfo (PlatformInfo.java:getJobFlowId(63)): Unable to read clusterId from /var/lib/instance-controller/extraInstanceData.json, trying EMR job-flow data file: /var/lib/info/job-flow.json 2022-12-14 02:24:18,907 INFO [pool-11-thread-3] util.PlatformInfo (PlatformInfo.java:getJobFlowId(71)): Unable to read clusterId from /var/lib/info/job-flow.json, out of places to look 2022-12-14 02:24:20,069 INFO [pool-11-thread-3] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): eagerFSInit: Eagerly initialized FileSystem at s3://does/not/exist in 1956 ms 2022-12-14 02:24:38,121 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 0 2022-12-14 02:24:38,211 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 0.0 (TID 0) 2022-12-14 02:24:39,166 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:39,212 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] client.TransportClientFactory (TransportClientFactory.java:createClient(310)): Successfully created connection to /172.35.86.171:39523 after 7 ms (5 ms spent in bootstraps) 2022-12-14 02:24:39,267 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_4_piece0 stored as bytes in memory (estimated size 36.9 KiB, free 5.8 GiB) 2022-12-14 02:24:39,277 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 4 took 111 ms 2022-12-14 02:24:39,321 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_4 stored as values in memory (estimated size 103.0 KiB, free 5.8 GiB) 2022-12-14 02:24:41,268 INFO [Executor task launch worker for task 0.0 in stage 0.0 (TID 0)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 0.0 (TID 0). 1012 bytes result sent to driver 2022-12-14 02:24:41,481 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 1 2022-12-14 02:24:41,482 INFO [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 1.0 (TID 1) 2022-12-14 02:24:41,538 INFO [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 5 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:41,544 INFO [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_5_piece0 stored as bytes in memory (estimated size 1882.0 B, free 5.8 GiB) 2022-12-14 02:24:41,550 INFO [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 5 took 12 ms 2022-12-14 02:24:41,552 INFO [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_5 stored as values in memory (estimated size 3.1 KiB, free 5.8 GiB) 2022-12-14 02:24:41,558 INFO [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 1.0 (TID 1). 803 bytes result sent to driver 2022-12-14 02:24:44,355 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 2 2022-12-14 02:24:44,408 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 2.0 (TID 2) 2022-12-14 02:24:44,414 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 6 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:44,420 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_6_piece0 stored as bytes in memory (estimated size 5.3 KiB, free 5.8 GiB) 2022-12-14 02:24:44,424 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 6 took 10 ms 2022-12-14 02:24:44,426 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_6 stored as values in memory (estimated size 9.8 KiB, free 5.8 GiB) 2022-12-14 02:24:44,556 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] memory.MemoryStore (Logging.scala:logInfo(61)): Block rdd_22_0 stored as values in memory (estimated size 224.0 B, free 5.8 GiB) 2022-12-14 02:24:44,624 INFO [Executor task launch worker for task 0.0 in stage 2.0 (TID 2)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 2.0 (TID 2). 1096 bytes result sent to driver 2022-12-14 02:24:44,684 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 3 2022-12-14 02:24:44,687 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 3.0 (TID 3) 2022-12-14 02:24:44,687 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 4 2022-12-14 02:24:44,688 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] executor.Executor (Logging.scala:logInfo(61)): Running task 1.0 in stage 3.0 (TID 4) 2022-12-14 02:24:44,688 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 5 2022-12-14 02:24:44,691 INFO [Executor task launch worker for task 2.0 in stage 3.0 (TID 5)] executor.Executor (Logging.scala:logInfo(61)): Running task 2.0 in stage 3.0 (TID 5) 2022-12-14 02:24:44,691 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 6 2022-12-14 02:24:44,693 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Updating epoch to 1 and clearing cache 2022-12-14 02:24:44,695 INFO [Executor task launch worker for task 3.0 in stage 3.0 (TID 6)] executor.Executor (Logging.scala:logInfo(61)): Running task 3.0 in stage 3.0 (TID 6) 2022-12-14 02:24:44,696 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 7 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:44,705 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_7_piece0 stored as bytes in memory (estimated size 3.3 KiB, free 5.8 GiB) 2022-12-14 02:24:44,709 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 7 took 12 ms 2022-12-14 02:24:44,711 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_7 stored as values in memory (estimated size 5.7 KiB, free 5.8 GiB) 2022-12-14 02:24:44,717 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Don't have map outputs for shuffle 0, fetching them 2022-12-14 02:24:44,717 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Don't have map outputs for shuffle 0, fetching them 2022-12-14 02:24:44,719 INFO [Executor task launch worker for task 2.0 in stage 3.0 (TID 5)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Don't have map outputs for shuffle 0, fetching them 2022-12-14 02:24:44,719 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@172.35.86.171:37187) 2022-12-14 02:24:44,722 INFO [Executor task launch worker for task 3.0 in stage 3.0 (TID 6)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Don't have map outputs for shuffle 0, fetching them 2022-12-14 02:24:44,896 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Got the map output locations 2022-12-14 02:24:44,939 INFO [Executor task launch worker for task 3.0 in stage 3.0 (TID 6)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 2022-12-14 02:24:44,939 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 2022-12-14 02:24:44,940 INFO [Executor task launch worker for task 2.0 in stage 3.0 (TID 5)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Getting 0 (0.0 B) non-empty blocks including 0 (0.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 2022-12-14 02:24:44,939 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Getting 1 (142.0 B) non-empty blocks including 1 (142.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 2022-12-14 02:24:44,943 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Started 0 remote fetches in 20 ms 2022-12-14 02:24:44,943 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Started 0 remote fetches in 15 ms 2022-12-14 02:24:44,943 INFO [Executor task launch worker for task 3.0 in stage 3.0 (TID 6)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Started 0 remote fetches in 16 ms 2022-12-14 02:24:44,949 INFO [Executor task launch worker for task 2.0 in stage 3.0 (TID 5)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Started 0 remote fetches in 24 ms 2022-12-14 02:24:44,978 INFO [Executor task launch worker for task 2.0 in stage 3.0 (TID 5)] executor.Executor (Logging.scala:logInfo(61)): Finished task 2.0 in stage 3.0 (TID 5). 1278 bytes result sent to driver 2022-12-14 02:24:44,978 INFO [Executor task launch worker for task 3.0 in stage 3.0 (TID 6)] executor.Executor (Logging.scala:logInfo(61)): Finished task 3.0 in stage 3.0 (TID 6). 1278 bytes result sent to driver 2022-12-14 02:24:44,983 INFO [Executor task launch worker for task 1.0 in stage 3.0 (TID 4)] executor.Executor (Logging.scala:logInfo(61)): Finished task 1.0 in stage 3.0 (TID 4). 1321 bytes result sent to driver 2022-12-14 02:24:44,993 INFO [Executor task launch worker for task 0.0 in stage 3.0 (TID 3)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 3.0 (TID 3). 1406 bytes result sent to driver 2022-12-14 02:24:45,198 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 7 2022-12-14 02:24:45,199 INFO [Executor task launch worker for task 0.0 in stage 4.0 (TID 7)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 4.0 (TID 7) 2022-12-14 02:24:45,202 INFO [Executor task launch worker for task 0.0 in stage 4.0 (TID 7)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 8 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:45,209 INFO [Executor task launch worker for task 0.0 in stage 4.0 (TID 7)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_8_piece0 stored as bytes in memory (estimated size 122.9 KiB, free 5.8 GiB) 2022-12-14 02:24:45,213 INFO [Executor task launch worker for task 0.0 in stage 4.0 (TID 7)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 8 took 10 ms 2022-12-14 02:24:45,216 INFO [Executor task launch worker for task 0.0 in stage 4.0 (TID 7)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_8 stored as values in memory (estimated size 344.0 KiB, free 5.8 GiB) 2022-12-14 02:24:45,390 INFO [Executor task launch worker for task 0.0 in stage 4.0 (TID 7)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 4.0 (TID 7). 848 bytes result sent to driver 2022-12-14 02:24:47,412 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 8 2022-12-14 02:24:47,413 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 5.0 (TID 8) 2022-12-14 02:24:47,415 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 9 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:47,422 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_9_piece0 stored as bytes in memory (estimated size 124.6 KiB, free 5.8 GiB) 2022-12-14 02:24:47,425 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 9 took 9 ms 2022-12-14 02:24:47,427 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_9 stored as values in memory (estimated size 348.2 KiB, free 5.8 GiB) 2022-12-14 02:24:47,455 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] storage.BlockManager (Logging.scala:logInfo(61)): Found block rdd_22_0 locally 2022-12-14 02:24:47,481 INFO [Executor task launch worker for task 0.0 in stage 5.0 (TID 8)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 5.0 (TID 8). 1050 bytes result sent to driver 2022-12-14 02:24:47,533 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 9 2022-12-14 02:24:47,533 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 6.0 (TID 9) 2022-12-14 02:24:47,534 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Updating epoch to 2 and clearing cache 2022-12-14 02:24:47,536 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 10 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:47,542 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_10_piece0 stored as bytes in memory (estimated size 168.1 KiB, free 5.8 GiB) 2022-12-14 02:24:47,545 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 10 took 8 ms 2022-12-14 02:24:47,548 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_10 stored as values in memory (estimated size 474.8 KiB, free 5.8 GiB) 2022-12-14 02:24:47,597 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Don't have map outputs for shuffle 1, fetching them 2022-12-14 02:24:47,597 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Doing the fetch; tracker endpoint = NettyRpcEndpointRef(spark://MapOutputTracker@172.35.86.171:37187) 2022-12-14 02:24:47,601 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] spark.MapOutputTrackerWorker (Logging.scala:logInfo(61)): Got the map output locations 2022-12-14 02:24:47,603 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Getting 1 (276.0 B) non-empty blocks including 1 (276.0 B) local and 0 (0.0 B) host-local and 0 (0.0 B) push-merged-local and 0 (0.0 B) remote blocks 2022-12-14 02:24:47,603 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] storage.ShuffleBlockFetcherIterator (Logging.scala:logInfo(61)): Started 0 remote fetches in 1 ms 2022-12-14 02:24:47,624 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] deltacommit.BaseSparkDeltaCommitActionExecutor (BaseSparkDeltaCommitActionExecutor.java:handleUpdate(76)): Merging updates for commit 00000000000000 for file files-0000 2022-12-14 02:24:47,735 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(245)): Creating View Manager with storage type :MEMORY 2022-12-14 02:24:47,738 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] view.FileSystemViewManager (FileSystemViewManager.java:createViewManager(257)): Creating in-memory based Table View 2022-12-14 02:24:47,740 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] view.FileSystemViewManager (FileSystemViewManager.java:createInMemoryFileSystemView(165)): Creating InMemory based view for basePath s3://glue-learn-begineers/data/.hoodie/metadata 2022-12-14 02:24:47,746 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] view.AbstractTableFileSystemView (AbstractTableFileSystemView.java:resetFileGroupsReplaced(244)): Took 2 ms to read 0 instants, 0 replaced file groups 2022-12-14 02:24:48,232 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] util.ClusteringUtils (ClusteringUtils.java:getAllFileGroupsInPendingClusteringPlans(137)): Found 0 files in pending clustering operations 2022-12-14 02:24:48,233 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] view.AbstractTableFileSystemView (AbstractTableFileSystemView.java:lambda$ensurePartitionLoadedCorrectly$9(302)): Building file system view for partition (files) 2022-12-14 02:24:48,400 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] view.AbstractTableFileSystemView (AbstractTableFileSystemView.java:addFilesToView(151)): addFilesToView: NumFiles=1, NumFileGroups=1, FileGroupsCreationTime=6, StoreTimeTaken=0 2022-12-14 02:24:48,401 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] io.HoodieAppendHandle (HoodieAppendHandle.java:init(156)): New AppendHandle for partition :files 2022-12-14 02:24:49,944 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(421)): close closed:false s3://glue-learn-begineers/data/.hoodie/metadata/files/.hoodie_partition_metadata_0 2022-12-14 02:24:50,282 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] s3n.S3NativeFileSystem (S3NativeFileSystem.java:rename(966)): rename s3://glue-learn-begineers/data/.hoodie/metadata/files/.hoodie_partition_metadata_0 s3://glue-learn-begineers/data/.hoodie/metadata/files/.hoodie_partition_metadata using algorithm version 1 2022-12-14 02:24:53,020 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] marker.DirectWriteMarkers (DirectWriteMarkers.java:create(173)): Creating Marker Path=s3://glue-learn-begineers/data/.hoodie/metadata/.hoodie/.temp/00000000000000/files/files-0000_0-6-9_00000000000000.hfile.marker.APPEND 2022-12-14 02:24:53,255 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(421)): close closed:false s3://glue-learn-begineers/data/.hoodie/metadata/.hoodie/.temp/00000000000000/files/files-0000_0-6-9_00000000000000.hfile.marker.APPEND 2022-12-14 02:24:53,340 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] marker.DirectWriteMarkers (DirectWriteMarkers.java:create(178)): [direct] Created marker file s3://glue-learn-begineers/data/.hoodie/metadata/.hoodie/.temp/00000000000000/files/files-0000_0-6-9_00000000000000.hfile.marker.APPEND in 1963 ms 2022-12-14 02:24:53,346 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] log.HoodieLogFormat$WriterBuilder (HoodieLogFormat.java:build(208)): Building HoodieLogFormat Writer 2022-12-14 02:24:53,347 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] log.HoodieLogFormat$WriterBuilder (HoodieLogFormat.java:build(255)): HoodieLogFile on path s3://glue-learn-begineers/data/.hoodie/metadata/files/.files-0000_00000000000000.log.1_0-6-9 2022-12-14 02:24:53,757 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] log.HoodieLogFormatWriter (HoodieLogFormatWriter.java:getOutputStream(125)): HoodieLogFile{pathStr='s3://glue-learn-begineers/data/.hoodie/metadata/files/.files-0000_00000000000000.log.1_0-6-9', fileLen=0} does not exist. Create a new file 2022-12-14 02:24:54,305 WARN [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] impl.MetricsConfig (MetricsConfig.java:loadFirst(136)): Cannot locate configuration: tried hadoop-metrics2-hbase.properties,hadoop-metrics2.properties 2022-12-14 02:24:54,319 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] impl.MetricsSystemImpl (MetricsSystemImpl.java:startTimer(378)): Scheduled Metric snapshot period at 10 second(s). 2022-12-14 02:24:54,319 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] impl.MetricsSystemImpl (MetricsSystemImpl.java:start(191)): HBase metrics system started 2022-12-14 02:24:54,358 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] metrics.MetricRegistries (MetricRegistriesLoader.java:load(63)): Loaded MetricRegistries class org.apache.hudi.org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl 2022-12-14 02:24:54,419 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] zlib.ZlibFactory (ZlibFactory.java:loadNativeZLib(59)): Successfully loaded & initialized native-zlib library 2022-12-14 02:24:54,421 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] compress.CodecPool (CodecPool.java:getCompressor(153)): Got brand-new compressor [.gz] 2022-12-14 02:24:54,422 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] compress.CodecPool (CodecPool.java:getCompressor(153)): Got brand-new compressor [.gz] 2022-12-14 02:24:54,579 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] io.HoodieAppendHandle (HoodieAppendHandle.java:processAppendResult(370)): AppendHandle for partitionPath files filePath files/.files-0000_00000000000000.log.1_0-6-9, took 6892 ms. 2022-12-14 02:24:54,580 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] s3n.MultipartUploadOutputStream (MultipartUploadOutputStream.java:close(421)): close closed:false s3://glue-learn-begineers/data/.hoodie/metadata/files/.files-0000_00000000000000.log.1_0-6-9 2022-12-14 02:24:54,829 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] memory.MemoryStore (Logging.scala:logInfo(61)): Block rdd_32_0 stored as values in memory (estimated size 279.0 B, free 5.8 GiB) 2022-12-14 02:24:54,901 INFO [Executor task launch worker for task 0.0 in stage 6.0 (TID 9)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 6.0 (TID 9). 1563 bytes result sent to driver 2022-12-14 02:24:55,049 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 10 2022-12-14 02:24:55,102 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 8.0 (TID 10) 2022-12-14 02:24:55,105 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 11 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:55,112 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_11_piece0 stored as bytes in memory (estimated size 168.1 KiB, free 5.8 GiB) 2022-12-14 02:24:55,114 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 11 took 9 ms 2022-12-14 02:24:55,117 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_11 stored as values in memory (estimated size 474.8 KiB, free 5.8 GiB) 2022-12-14 02:24:55,133 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] storage.BlockManager (Logging.scala:logInfo(61)): Found block rdd_32_0 locally 2022-12-14 02:24:55,138 INFO [Executor task launch worker for task 0.0 in stage 8.0 (TID 10)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 8.0 (TID 10). 1176 bytes result sent to driver 2022-12-14 02:24:55,749 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 11 2022-12-14 02:24:55,750 INFO [Executor task launch worker for task 0.0 in stage 9.0 (TID 11)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 9.0 (TID 11) 2022-12-14 02:24:55,752 INFO [Executor task launch worker for task 0.0 in stage 9.0 (TID 11)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 12 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:55,758 INFO [Executor task launch worker for task 0.0 in stage 9.0 (TID 11)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_12_piece0 stored as bytes in memory (estimated size 42.5 KiB, free 5.8 GiB) 2022-12-14 02:24:55,760 INFO [Executor task launch worker for task 0.0 in stage 9.0 (TID 11)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 12 took 7 ms 2022-12-14 02:24:55,762 INFO [Executor task launch worker for task 0.0 in stage 9.0 (TID 11)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_12 stored as values in memory (estimated size 125.6 KiB, free 5.8 GiB) 2022-12-14 02:24:55,933 INFO [Executor task launch worker for task 0.0 in stage 9.0 (TID 11)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 9.0 (TID 11). 804 bytes result sent to driver 2022-12-14 02:24:56,769 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 12 2022-12-14 02:24:56,770 INFO [Executor task launch worker for task 0.0 in stage 10.0 (TID 12)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 10.0 (TID 12) 2022-12-14 02:24:56,774 INFO [Executor task launch worker for task 0.0 in stage 10.0 (TID 12)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 13 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:56,779 INFO [Executor task launch worker for task 0.0 in stage 10.0 (TID 12)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_13_piece0 stored as bytes in memory (estimated size 42.5 KiB, free 5.8 GiB) 2022-12-14 02:24:56,781 INFO [Executor task launch worker for task 0.0 in stage 10.0 (TID 12)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 13 took 6 ms 2022-12-14 02:24:56,782 INFO [Executor task launch worker for task 0.0 in stage 10.0 (TID 12)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_13 stored as values in memory (estimated size 125.8 KiB, free 5.8 GiB) 2022-12-14 02:24:57,352 INFO [Executor task launch worker for task 0.0 in stage 10.0 (TID 12)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 10.0 (TID 12). 895 bytes result sent to driver 2022-12-14 02:24:58,483 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 13 2022-12-14 02:24:58,484 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 12.0 (TID 13) 2022-12-14 02:24:58,486 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 14 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:24:58,491 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_14_piece0 stored as bytes in memory (estimated size 168.1 KiB, free 5.8 GiB) 2022-12-14 02:24:58,494 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 14 took 7 ms 2022-12-14 02:24:58,496 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_14 stored as values in memory (estimated size 474.3 KiB, free 5.8 GiB) 2022-12-14 02:24:58,511 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] storage.BlockManager (Logging.scala:logInfo(61)): Found block rdd_32_0 locally 2022-12-14 02:24:58,513 INFO [Executor task launch worker for task 0.0 in stage 12.0 (TID 13)] executor.Executor (Logging.scala:logInfo(61)): Finished task 0.0 in stage 12.0 (TID 13). 1256 bytes result sent to driver 2022-12-14 02:25:03,187 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 14 2022-12-14 02:25:03,247 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.0 in stage 13.0 (TID 14) 2022-12-14 02:25:03,295 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 15 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:25:03,300 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_15_piece0 stored as bytes in memory (estimated size 3.9 KiB, free 5.8 GiB) 2022-12-14 02:25:03,303 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 15 took 7 ms 2022-12-14 02:25:03,305 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_15 stored as values in memory (estimated size 6.8 KiB, free 5.8 GiB) 2022-12-14 02:25:03,359 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB) 2022-12-14 02:25:03,366 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_3_piece0 stored as bytes in memory (estimated size 105.0 B, free 5.8 GiB) 2022-12-14 02:25:03,368 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] broadcast.TorrentBroadcast (Logging.scala:logInfo(61)): Reading broadcast variable 3 took 8 ms 2022-12-14 02:25:03,373 INFO [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] memory.MemoryStore (Logging.scala:logInfo(61)): Block broadcast_3 stored as values in memory (estimated size 40.0 B, free 5.8 GiB) 2022-12-14 02:25:03,377 ERROR [Executor task launch worker for task 0.0 in stage 13.0 (TID 14)] executor.Executor (Logging.scala:logError(98)): Exception in task 0.0 in stage 13.0 (TID 14) java.lang.NullPointerException: null at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:842) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:644) ~[AWSGlueDataplane-1.0.jar:?] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_352] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_352] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352] 2022-12-14 02:25:03,413 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 15 2022-12-14 02:25:03,414 INFO [Executor task launch worker for task 0.1 in stage 13.0 (TID 15)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.1 in stage 13.0 (TID 15) 2022-12-14 02:25:03,417 ERROR [Executor task launch worker for task 0.1 in stage 13.0 (TID 15)] executor.Executor (Logging.scala:logError(98)): Exception in task 0.1 in stage 13.0 (TID 15) java.lang.NullPointerException: null at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:842) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:644) ~[AWSGlueDataplane-1.0.jar:?] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_352] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_352] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352] 2022-12-14 02:25:03,426 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 16 2022-12-14 02:25:03,426 INFO [Executor task launch worker for task 0.2 in stage 13.0 (TID 16)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.2 in stage 13.0 (TID 16) 2022-12-14 02:25:03,429 ERROR [Executor task launch worker for task 0.2 in stage 13.0 (TID 16)] executor.Executor (Logging.scala:logError(98)): Exception in task 0.2 in stage 13.0 (TID 16) java.lang.NullPointerException: null at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:842) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:644) ~[AWSGlueDataplane-1.0.jar:?] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_352] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_352] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352] 2022-12-14 02:25:03,436 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Got assigned task 17 2022-12-14 02:25:03,437 INFO [Executor task launch worker for task 0.3 in stage 13.0 (TID 17)] executor.Executor (Logging.scala:logInfo(61)): Running task 0.3 in stage 13.0 (TID 17) 2022-12-14 02:25:03,440 ERROR [Executor task launch worker for task 0.3 in stage 13.0 (TID 17)] executor.Executor (Logging.scala:logError(98)): Exception in task 0.3 in stage 13.0 (TID 17) java.lang.NullPointerException: null at org.apache.hadoop.conf.Configuration.<init>(Configuration.java:842) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at org.apache.hadoop.mapred.JobConf.<init>(JobConf.java:458) ~[hadoop-client-api-3.3.3-amzn-0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getJobConf(DynamoConnection.scala:61) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.connections.DynamoConnection.getReader(DynamoConnection.scala:136) ~[AWSGlueDataplane-1.0.jar:?] at com.amazonaws.services.glue.DynamicRecordRDD.compute(DataSource.scala:644) ~[AWSGlueDataplane-1.0.jar:?] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.scheduler.Task.run(Task.scala:138) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) ~[spark-core_2.12-3.3.0-amzn-1.jar:3.3.0-amzn-1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.0-amzn-1.jar:?] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_352] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_352] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_352] 2022-12-14 02:25:03,558 INFO [block-manager-storage-async-thread-pool-39] storage.BlockManager (Logging.scala:logInfo(61)): Removing RDD 41 2022-12-14 02:25:03,570 INFO [block-manager-storage-async-thread-pool-46] storage.BlockManager (Logging.scala:logInfo(61)): Removing RDD 22 2022-12-14 02:25:03,574 INFO [block-manager-storage-async-thread-pool-49] storage.BlockManager (Logging.scala:logInfo(61)): Removing RDD 32 2022-12-14 02:25:03,878 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Driver commanded a shutdown 2022-12-14 02:25:03,880 INFO [CoarseGrainedExecutorBackend-stop-executor] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: SparkContext stopped - not reporting metrics now. 2022-12-14 02:25:03,900 INFO [CoarseGrainedExecutorBackend-stop-executor] memory.MemoryStore (Logging.scala:logInfo(61)): MemoryStore cleared 2022-12-14 02:25:03,901 INFO [CoarseGrainedExecutorBackend-stop-executor] storage.BlockManager (Logging.scala:logInfo(61)): BlockManager stopped Continuous Logging: Shutting down cloudwatch appender. Continuous Logging: Shutting down cloudwatch appender. 2022-12-14 02:25:03,956 INFO [shutdown-hook-0] util.ShutdownHookManager (Logging.scala:logInfo(61)): Shutdown hook called ``` ##### Note i am not running parallel jobs. only single job is running. please let me know if any settings you see in my code is off i am happy to correct and give a try again. ``` "hoodie.datasource.write.recordkey.field": "pk", "hoodie.datasource.write.precombine.field": "pk", ``` can i use same for recordkey and precombine field ? Steps and code were given on AWS Website https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-hudi.html if i am doing something wrong happy to make changes based on your suggestion and i am willing to try again looking fwd for your help -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org