stackfun opened a new issue #2771:
URL: https://github.com/apache/hudi/issues/2771


   **Describe the problem you faced**
   
   Sometimes, log files with only upserts are not compacted in MOR table. 
    The first image shows the compacted parquet files, note that it was created 
March 30th.
   
![image](https://user-images.githubusercontent.com/68627128/113636890-3bbc8180-9628-11eb-8570-31b47355f1f6.png)
   
   The second image has the log files, which were created after the 30th, but 
they are never compacted. In our use case, we have a lot of small random 
upserts.
   
![image](https://user-images.githubusercontent.com/68627128/113636936-52fb6f00-9628-11eb-8396-c7bf2a968272.png)
   
   Here's our hudi configuration.
   ```
   options = {
           # Configs that we shouldn't change
           "hoodie.table.name": table_name,
           "hoodie.datasource.write.hive_style_partitioning": True,
           "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
           "hoodie.datasource.write.operation": "upsert",
           "hoodie.datasource.write.partitionpath.field": "field_1,field2",
           "hoodie.datasource.write.recordkey.field": "sha256",
           "hoodie.datasource.write.table.name": table_name,
           "hoodie.datasource.write.table.type": "MERGE_ON_READ",
           "hoodie.datasource.compaction.async.enable": True,
           "hoodie.index.type": "SIMPLE",  
           "hoodie.compact.inline": True,  
           "hoodie.clean.async": True, 
           'hoodie.clean.automatic': True,
           "hoodie.simple.index.input.storage.level": "DISK_ONLY", 
           "hoodie.write.status.storage.level": "DISK_ONLY",
           'hoodie.cleaner.commits.retained': 2,
           "hoodie.compact.inline.max.delta.commits": "16",   
           "hoodie.logfile.data.block.max.size": 1024 * 1024 * 8, # Workaround 
for https://github.com/apache/hudi/issues/2692
           "hoodie.logfile.max.size": 1024 * 1024 * 8, # Workaround for 
https://github.com/apache/hudi/issues/2692
           "hoodie.memory.merge.fraction": "0.75", # default is 0.6, allocate 
more memory for merging
       }
   ```
   
   **To Reproduce**
   
   Currently trying to reproduce with a small example, but not successful yet.
   
   **Expected behavior**
   
   Compaction on this file group running
   
   **Environment Description**
   
   * Hudi version : 0.7.0 (with #2500 merged)
   
   * Spark version : 2.4.7
   
   * Hive version : 2.3.7
   
   * Hadoop version : 2.9.2
   
   * Storage (HDFS/S3/GCS..) : GCS
   
   * Running on Docker? (yes/no) : no
   
   * GCP Dataproc: 1.4
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to