[GitHub] [hudi] abhijeetkushe commented on issue #1737: [SUPPORT]spark streaming create small parquet files

GitBox Tue, 22 Sep 2020 12:16:23 -0700


abhijeetkushe commented on issue #1737:
URL: https://github.com/apache/hudi/issues/1737#issuecomment-696926015



   @n3nash Apologies for the delayed response.I tried a bunch of heuristics 
from the available config options for both COW and MOR and I think I got a idea 
of how the file creation happens.I am using emr-5.30.1 which hudi 
0.5.2-incubating and presto 0.232
   I did observe a few things and have a few questions on that
   FOR COW table I am writing 100 MB data multiple times using the below 
options.
   {
                   'hoodie.table.name': 'click',
                   'hoodie.datasource.write.recordkey.field': 
'campaign_activity_id,contact_id,created_on',
                   'hoodie.datasource.write.keygenerator.class': 
'org.apache.hudi.keygen.ComplexKeyGenerator',
                   'hoodie.datasource.write.partitionpath.field': 'bucket',
                   'hoodie.datasource.write.hive_style_partitioning': True,
                   'hoodie.datasource.write.table.name':  'click',
                   'hoodie.datasource.write.operation': 'insert',
                   'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
                   'hoodie.datasource.write.precombine.field': 'created_on',
                   'hoodie.parquet.small.file.limit': 25000000,  # 25 MB (50 X 
0.5 million = 25 MB )
                   'hoodie.copyonwrite.insert.split.size': 500000, # 0.5 million
                   'hoodie.copyonwrite.record.size.estimate': 50, # 50 bytes 
per record
                   'hoodie.parquet.max.file.size': 50000000, # 50MB 
                   'hoodie.parquet.block.size': 50000000,
                   'hoodie.copyonwrite.insert.auto.split': False,
                   #"hoodie.embed.timeline.server": False,
                   'hoodie.clean.automatic': True,
                   'hoodie.clean.async': False,
                   'hoodie.cleaner.commits.retained': 1,
                   'hoodie.upsert.shuffle.parallelism': 2,
                   'hoodie.insert.shuffle.parallelism': 2,
                   'hoodie.datasource.hive_sync.partition_fields': 'bucket',
                   'hoodie.datasource.hive_sync.enable': True,
                   'hoodie.datasource.hive_sync.table': 'click',
                   'hoodie.datasource.hive_sync.partition_extractor_class': 
'org.apache.hudi.hive.MultiPartKeysValueExtractor'
               }
    I ran into Invalid Parquet issue after the 3rd write 
https://github.com/prestodb/presto/issues/13457 which will be fixed in a later 
version of presto.But I noticed that there were files being created larger than 
50MB  which is different from max file specified above (Snapshots below). I 
noticed the same behavior for MOR for which I believe the problem ought to be 
the same since this is parquet file
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] abhijeetkushe commented on issue #1737: [SUPPORT]spark streaming create small parquet files

Reply via email to