abhijeetkushe commented on issue #1737: URL: https://github.com/apache/hudi/issues/1737#issuecomment-696926015
@n3nash Apologies for the delayed response.I tried a bunch of heuristics from the available config options for both COW and MOR and I think I got a idea of how the file creation happens.I am using emr-5.30.1 which hudi 0.5.2-incubating and presto 0.232 I did observe a few things and have a few questions on that FOR COW table I am writing 100 MB data multiple times using the below options. { 'hoodie.table.name': 'click', 'hoodie.datasource.write.recordkey.field': 'campaign_activity_id,contact_id,created_on', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', 'hoodie.datasource.write.partitionpath.field': 'bucket', 'hoodie.datasource.write.hive_style_partitioning': True, 'hoodie.datasource.write.table.name': 'click', 'hoodie.datasource.write.operation': 'insert', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.write.precombine.field': 'created_on', 'hoodie.parquet.small.file.limit': 25000000, # 25 MB (50 X 0.5 million = 25 MB ) 'hoodie.copyonwrite.insert.split.size': 500000, # 0.5 million 'hoodie.copyonwrite.record.size.estimate': 50, # 50 bytes per record 'hoodie.parquet.max.file.size': 50000000, # 50MB 'hoodie.parquet.block.size': 50000000, 'hoodie.copyonwrite.insert.auto.split': False, #"hoodie.embed.timeline.server": False, 'hoodie.clean.automatic': True, 'hoodie.clean.async': False, 'hoodie.cleaner.commits.retained': 1, 'hoodie.upsert.shuffle.parallelism': 2, 'hoodie.insert.shuffle.parallelism': 2, 'hoodie.datasource.hive_sync.partition_fields': 'bucket', 'hoodie.datasource.hive_sync.enable': True, 'hoodie.datasource.hive_sync.table': 'click', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor' } I ran into Invalid Parquet issue after the 3rd write https://github.com/prestodb/presto/issues/13457 which will be fixed in a later version of presto.But I noticed that there were files being created larger than 50MB which is different from max file specified above (Snapshots below). I noticed the same behavior for MOR for which I believe the problem ought to be the same since this is parquet file ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org