[GitHub] [hudi] wosow opened a new issue #2676: [SUPPORT] When I used 100,000 data to update 100 million data, The program is stuck

GitBox Mon, 15 Mar 2021 07:21:09 -0700


wosow opened a new issue #2676:
URL: https://github.com/apache/hudi/issues/2676



   **Environment Description**
   
   * Hudi version : 0.7.0/0.6.0
   
   * Spark version : 2.4.4 
   
   * Hive version :2.3.1
   
   * Hadoop version : 2.7.5
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   When I used 100,000 data to update 100 million data, the program was stuck 
and could not execute further. The table type used was MOR. The program 
execution diagram is as follows:
   
![image](https://user-images.githubusercontent.com/34565079/111167633-48772800-85dc-11eb-9072-1f4f7a3a2c54.png)
   
   hudi parameters as follow:
       TABLE_TYPE_OPT_KEY -> MOR_TABLE_TYPE_OPT_VAL, 
   //      OPERATION_OPT_KEY -> WriteOperationType.UPSERT.value, 
         OPERATION_OPT_KEY -> "upsert",  
         RECORDKEY_FIELD_OPT_KEY -> pkCol,  
         PRECOMBINE_FIELD_OPT_KEY -> preCombineCol,  
         "hoodie.embed.timeline.server" -> "false",
         "hoodie.cleaner.commits.retained" -> "1",
         "hoodie.cleaner.fileversions.retained" -> "1",
         "hoodie.cleaner.policy" -> 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name(),
         "hoodie.keep.min.commits" -> "3",
         "hoodie.keep.max.commits" -> "4",
         "hoodie.compact.inline" -> "true",
         "hoodie.compact.inline.max.delta.commits" -> "1",
         //      "hoodie.copyonwrite.record.size.estimate" -> 
String.valueOf(500),
         PARTITIONPATH_FIELD_OPT_KEY -> "dt", 
         HIVE_PARTITION_FIELDS_OPT_KEY -> "dt",
         HIVE_URL_OPT_KEY -> "jdbc:hive2:/0.0.0.0:10000",
         HIVE_USER_OPT_KEY -> "",
         HIVE_PASS_OPT_KEY -> "",
         HIVE_DATABASE_OPT_KEY -> hiveDatabaseName,
         HIVE_TABLE_OPT_KEY -> hiveTableName,
         HIVE_SYNC_ENABLED_OPT_KEY -> "true",
         HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH -> "true",
         HoodieWriteConfig.TABLE_NAME -> hiveTableName,  
         HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
         HoodieIndexConfig.INDEX_TYPE_PROP -> 
HoodieIndex.IndexType.GLOBAL_BLOOM.name(),
         "hoodie.insert.shuffle.parallelism" -> parallelism,
         "hoodie.upsert.shuffle.parallelism" -> parallelism
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] wosow opened a new issue #2676: [SUPPORT] When I used 100,000 data to update 100 million data, The program is stuck

Reply via email to