[ https://issues.apache.org/jira/browse/HUDI-860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinoth Chandar updated HUDI-860: -------------------------------- Priority: Blocker (was: Major) > Ability to do small file handling without need for caching > ---------------------------------------------------------- > > Key: HUDI-860 > URL: https://issues.apache.org/jira/browse/HUDI-860 > Project: Apache Hudi > Issue Type: Sub-task > Components: Writer Core > Reporter: Vinoth Chandar > Assignee: sivabalan narayanan > Priority: Blocker > Fix For: 0.7.0 > > > As of now, in upsert path, > * hudi builds a workloadProfile to understand total inserts and updates(with > location info) > * Following which, small files info are populated > * Then buckets are populated with above info. > * These buckets are later used when getPartition(Object key) is invoked in > UpsertPartitioner. > In step1: to build global workload profile, we had to do an action on entire > JavaRDD<HoodieRecord>s in the driver and hudi does save the workload profile > as well. > For large write intensive batch jobs(COW types), caching this incurs > additional overhead. So, this effort is trying to see if we can avoid doing > this by some means. > > -- This message was sent by Atlassian Jira (v8.3.4#803005)