[ https://issues.apache.org/jira/browse/HUDI-4363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Raymond Xu reassigned HUDI-4363: -------------------------------- Assignee: Hui An > Support Clustering row writer to improve performance > ---------------------------------------------------- > > Key: HUDI-4363 > URL: https://issues.apache.org/jira/browse/HUDI-4363 > Project: Apache Hudi > Issue Type: Improvement > Components: performance, writer-core > Reporter: Hui An > Assignee: Hui An > Priority: Critical > Labels: pull-request-available > Fix For: 0.13.0 > > Attachments: Screen Shot 2022-07-05 at 17.25.13.png > > > 1. Integrate clustering with datasource read and write api, in this way, > - enable clustering use Dataset api > - Unify the read and write operations together, if read/write logic has > improvement, clustering can also benefit, such as vectorized read > 2. Use {{hoodie.datasource.read.paths}} to pass paths for each > clusteringOperation > 3. Introduce {{HoodieInternalWriteStatusCoordinator}} to persist the > {{InternalWriteStatus}} of a clustering action. As we can not get it if using > Spark datasource. > 4. Add new configures to control this behavior. > h4. Test performance > A test table has 21 columns, 710716 rows, raw data size 929g(in spark > memory), after compressed: 38.3g > executor memory: 50g, 20 instances, and enable global_sort > Without clustering as row: 32mins, 12sec > Using clustering as row: 9mins, 51sec > We can also see the performance improve from test: > {{TestHoodieSparkMergeOnReadTableClustering}} and > {{testLayoutOptimizationFunctional}} > !Screen Shot 2022-07-05 at 17.25.13.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)