[
https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
voon updated HUDI-5496:
---
Description:
Suppose a partition is no longer being written/updated, i.e. there will be no
changes to the partition, therefore, size of parquet files will always be the
same.
If the parquet files in the partition (even after prior clustering) is smaller
than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the fileSlice will
always be returned as a candidate for
{_}getFileSlicesEligibleForClustering(){_}.
This may cause inputGroups with only 1 fileSlice to be selected as candidates
for clustering. An of a clusteringPlan demonstrating such a case in JSON format
is seen below.
{code:java}
{
"inputGroups": [
{
"slices": [
{
"dataFilePath":
"/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
"deltaFilePaths": [],
"fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
"partitionPath": "dt=2023-01-03",
"bootstrapFilePath": "",
"version": 1
}
],
"metrics": {
"TOTAL_LOG_FILES": 0.0,
"TOTAL_IO_MB": 260.0,
"TOTAL_IO_READ_MB": 130.0,
"TOTAL_LOG_FILES_SIZE": 0.0,
"TOTAL_IO_WRITE_MB": 130.0
},
"numOutputFileGroups": 1,
"extraMetadata": null,
"version": 1
},
{
"slices": [
{
"dataFilePath":
"/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
"deltaFilePaths": [],
"fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
"partitionPath": "dt=2023-01-04",
"bootstrapFilePath": "",
"version": 1
},
{
"dataFilePath":
"/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
"deltaFilePaths": [],
"fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
"partitionPath": "dt=2023-01-04",
"bootstrapFilePath": "",
"version": 1
}
],
"metrics": {
"TOTAL_LOG_FILES": 0.0,
"TOTAL_IO_MB": 418.0,
"TOTAL_IO_READ_MB": 209.0,
"TOTAL_LOG_FILES_SIZE": 0.0,
"TOTAL_IO_WRITE_MB": 209.0
},
"numOutputFileGroups": 1,
"extraMetadata": null,
"version": 1
}
],
"strategy": {
"strategyClassName":
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
"strategyParams": {},
"version": 1
},
"extraMetadata": {},
"version": 1,
"preserveHoodieMetadata": true
}{code}
Such a case will cause performance issues as a parquet file is re-written
unnecessarily (write amplification).
The fix is to only select inputGroups with more than 1 fileSlice as candidates
for clustering.
was:
Suppose a partition is no longer being written/updated, i.e. there will be no
changes to the partition, therefore, size of parquet files will always be the
same.
If the parquet files in the partition (even after prior clustering) is smaller
than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, ** the fileSlice
will always be returned as a candidate for
{_}getFileSlicesEligibleForClustering(){_}.
This may cause inputGroups with only 1 fileSlice to be selected as candidates
for clustering. An of a clusteringPlan demonstrating such a case in JSON format
is seen below.
{code:java}
{
"inputGroups": [
{
"slices": [
{
"dataFilePath":
"/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
"deltaFilePaths": [],
"fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
"partitionPath": "dt=2023-01-03",
"bootstrapFilePath": "",
"version": 1
}
],
"metrics": {
"TOTAL_LOG_FILES": 0.0,
"TOTAL_IO_MB": 260.0,
"TOTAL_IO_READ_MB": 130.0,
"TOTAL_LOG_FILES_SIZE": 0.0,
"TOTAL_IO_WRITE_MB": 130.0
},
"numOutputFileGroups": 1,
"extraMetadata": null,
"version": 1
},
{
"slices": [
{
"dataFilePath":
"/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
"deltaFilePaths": [],
"fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
"partitionPath": "dt=2023-01-04",
"bootstrapFilePath": "",
"version": 1
},
{
"dataFilePath":
"/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
"deltaFilePaths": [],
"fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
"partitionPath":