[ https://issues.apache.org/jira/browse/HUDI-1054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bhavani Sudha resolved HUDI-1054. --------------------------------- Resolution: Fixed > Address performance issues with finalizing writes on S3 > ------------------------------------------------------- > > Key: HUDI-1054 > URL: https://issues.apache.org/jira/browse/HUDI-1054 > Project: Apache Hudi > Issue Type: Sub-task > Components: bootstrap, Common Core, Performance > Reporter: Udit Mehrotra > Assignee: Udit Mehrotra > Priority: Blocker > Labels: pull-request-available > Fix For: 0.6.0 > > > I have identified 3 performance bottleneck in the > [finalizeWrite|https://github.com/apache/hudi/blob/master/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L378] > function, that are manifesting and becoming more prominent with the new > bootstrap mechanism on S3: > * > [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L425] > is a serial operation performed at the driver and it can take a long time > when you have several partitions and large number of files. > * The invalid data paths are being stored in a List instead of Set and as a > result the following operation becomes N^2 taking significant time to compute > at the driver: > [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L429] > * > [https://github.com/apache/hudi/blob/5e476733417c3f92ea97d3e5f9a5c8bc48246e99/hudi-client/src/main/java/org/apache/hudi/table/HoodieTable.java#L473] > does a recursive delete of the marker directory at the driver. This is again > extremely expensive when you have large number of partitions and files. > > Upon testing with a 1 TB data set, having 8000 partitions and approximately > 190000 files this whole process consumes *35 minutes*. There is scope to > address these performance issues with spark parallelization and using > appropriate data structures. -- This message was sent by Atlassian Jira (v8.3.4#803005)