wosow opened a new issue #2676:
URL: https://github.com/apache/hudi/issues/2676


   **Environment Description**
   
   * Hudi version : 0.7.0/0.6.0
   
   * Spark version : 2.4.4 
   
   * Hive version :2.3.1
   
   * Hadoop version : 2.7.5
   
   * Storage (HDFS/S3/GCS..) : HDFS
   
   * Running on Docker? (yes/no) : no
   
   When I used 100,000 data to update 100 million data, the program was stuck 
and could not execute further. The table type used was MOR. The program 
execution diagram is as follows:
   
![image](https://user-images.githubusercontent.com/34565079/111167633-48772800-85dc-11eb-9072-1f4f7a3a2c54.png)
   
   hudi parameters as follow:
       TABLE_TYPE_OPT_KEY -> MOR_TABLE_TYPE_OPT_VAL, 
   //      OPERATION_OPT_KEY -> WriteOperationType.UPSERT.value, 
         OPERATION_OPT_KEY -> "upsert",  
         RECORDKEY_FIELD_OPT_KEY -> pkCol,  
         PRECOMBINE_FIELD_OPT_KEY -> preCombineCol,  
         "hoodie.embed.timeline.server" -> "false",
         "hoodie.cleaner.commits.retained" -> "1",
         "hoodie.cleaner.fileversions.retained" -> "1",
         "hoodie.cleaner.policy" -> 
HoodieCleaningPolicy.KEEP_LATEST_FILE_VERSIONS.name(),
         "hoodie.keep.min.commits" -> "3",
         "hoodie.keep.max.commits" -> "4",
         "hoodie.compact.inline" -> "true",
         "hoodie.compact.inline.max.delta.commits" -> "1",
         //      "hoodie.copyonwrite.record.size.estimate" -> 
String.valueOf(500),
         PARTITIONPATH_FIELD_OPT_KEY -> "dt", 
         HIVE_PARTITION_FIELDS_OPT_KEY -> "dt",
         HIVE_URL_OPT_KEY -> "jdbc:hive2:/0.0.0.0:10000",
         HIVE_USER_OPT_KEY -> "",
         HIVE_PASS_OPT_KEY -> "",
         HIVE_DATABASE_OPT_KEY -> hiveDatabaseName,
         HIVE_TABLE_OPT_KEY -> hiveTableName,
         HIVE_SYNC_ENABLED_OPT_KEY -> "true",
         HoodieIndexConfig.BLOOM_INDEX_UPDATE_PARTITION_PATH -> "true",
         HoodieWriteConfig.TABLE_NAME -> hiveTableName,  
         HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY -> 
classOf[MultiPartKeysValueExtractor].getName,
         HoodieIndexConfig.INDEX_TYPE_PROP -> 
HoodieIndex.IndexType.GLOBAL_BLOOM.name(),
         "hoodie.insert.shuffle.parallelism" -> parallelism,
         "hoodie.upsert.shuffle.parallelism" -> parallelism
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to