[ 
https://issues.apache.org/jira/browse/HUDI-5496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

voon updated HUDI-5496:
-----------------------
    Description: 
Suppose a partition is no longer being written/updated, i.e. there will be no 
changes to the partition, therefore, size of parquet files will always be the 
same. 

 

If the parquet files in the partition (even after prior clustering) is smaller 
than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the fileSlice will 
always be returned as a candidate for 
{_}getFileSlicesEligibleForClustering(){_}.

 

This may cause inputGroups with only 1 fileSlice to be selected as candidates 
for clustering. An of a clusteringPlan demonstrating such a case in JSON format 
is seen below.

 

 
{code:java}
{
  "inputGroups": [
    {
      "slices": [
        {
          "dataFilePath": 
"/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
          "deltaFilePaths": [],
          "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
          "partitionPath": "dt=2023-01-03",
          "bootstrapFilePath": "",
          "version": 1
        }
      ],
      "metrics": {
        "TOTAL_LOG_FILES": 0.0,
        "TOTAL_IO_MB": 260.0,
        "TOTAL_IO_READ_MB": 130.0,
        "TOTAL_LOG_FILES_SIZE": 0.0,
        "TOTAL_IO_WRITE_MB": 130.0
      },
      "numOutputFileGroups": 1,
      "extraMetadata": null,
      "version": 1
    },
    {
      "slices": [
        {
          "dataFilePath": 
"/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
          "deltaFilePaths": [],
          "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
          "partitionPath": "dt=2023-01-04",
          "bootstrapFilePath": "",
          "version": 1
        },
        {
          "dataFilePath": 
"/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
          "deltaFilePaths": [],
          "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
          "partitionPath": "dt=2023-01-04",
          "bootstrapFilePath": "",
          "version": 1
        }
      ],
      "metrics": {
        "TOTAL_LOG_FILES": 0.0,
        "TOTAL_IO_MB": 418.0,
        "TOTAL_IO_READ_MB": 209.0,
        "TOTAL_LOG_FILES_SIZE": 0.0,
        "TOTAL_IO_WRITE_MB": 209.0
      },
      "numOutputFileGroups": 1,
      "extraMetadata": null,
      "version": 1
    }
  ],
  "strategy": {
    "strategyClassName": 
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
    "strategyParams": {},
    "version": 1
  },
  "extraMetadata": {},
  "version": 1,
  "preserveHoodieMetadata": true
}{code}
 

Such a case will cause performance issues as a parquet file is re-written 
unnecessarily (write amplification). 

 

The fix is to only select inputGroups with more than 1 fileSlice as candidates 
for clustering.

 

  was:
Suppose a partition is no longer being written/updated, i.e. there will be no 
changes to the partition, therefore, size of parquet files will always be the 
same. 

 

If the parquet files in the partition (even after prior clustering) is smaller 
than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, ** the fileSlice 
will always be returned as a candidate for 
{_}getFileSlicesEligibleForClustering(){_}.

 

This may cause inputGroups with only 1 fileSlice to be selected as candidates 
for clustering. An of a clusteringPlan demonstrating such a case in JSON format 
is seen below.

 

 
{code:java}
{
  "inputGroups": [
    {
      "slices": [
        {
          "dataFilePath": 
"/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
          "deltaFilePaths": [],
          "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
          "partitionPath": "dt=2023-01-03",
          "bootstrapFilePath": "",
          "version": 1
        }
      ],
      "metrics": {
        "TOTAL_LOG_FILES": 0.0,
        "TOTAL_IO_MB": 260.0,
        "TOTAL_IO_READ_MB": 130.0,
        "TOTAL_LOG_FILES_SIZE": 0.0,
        "TOTAL_IO_WRITE_MB": 130.0
      },
      "numOutputFileGroups": 1,
      "extraMetadata": null,
      "version": 1
    },
    {
      "slices": [
        {
          "dataFilePath": 
"/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
          "deltaFilePaths": [],
          "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
          "partitionPath": "dt=2023-01-04",
          "bootstrapFilePath": "",
          "version": 1
        },
        {
          "dataFilePath": 
"/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
          "deltaFilePaths": [],
          "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
          "partitionPath": "dt=2023-01-04",
          "bootstrapFilePath": "",
          "version": 1
        }
      ],
      "metrics": {
        "TOTAL_LOG_FILES": 0.0,
        "TOTAL_IO_MB": 418.0,
        "TOTAL_IO_READ_MB": 209.0,
        "TOTAL_LOG_FILES_SIZE": 0.0,
        "TOTAL_IO_WRITE_MB": 209.0
      },
      "numOutputFileGroups": 1,
      "extraMetadata": null,
      "version": 1
    }
  ],
  "strategy": {
    "strategyClassName": 
"org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
    "strategyParams": {},
    "version": 1
  },
  "extraMetadata": {},
  "version": 1,
  "preserveHoodieMetadata": true
}{code}
 

Such a case will cause performance issues as a parquet file is re-written 
unnecessarily (write amplification). 

 

The fix is to only select inputGroups with more than 1 fileSlice as candidates 
for clustering.

 


> Prevent Hudi from generating clustering plans with filegroups consisting of 
> only 1 fileSlice
> --------------------------------------------------------------------------------------------
>
>                 Key: HUDI-5496
>                 URL: https://issues.apache.org/jira/browse/HUDI-5496
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: voon
>            Priority: Major
>
> Suppose a partition is no longer being written/updated, i.e. there will be no 
> changes to the partition, therefore, size of parquet files will always be the 
> same. 
>  
> If the parquet files in the partition (even after prior clustering) is 
> smaller than {*}hoodie.clustering.plan.strategy.small.file.limit{*}, the 
> fileSlice will always be returned as a candidate for 
> {_}getFileSlicesEligibleForClustering(){_}.
>  
> This may cause inputGroups with only 1 fileSlice to be selected as candidates 
> for clustering. An of a clusteringPlan demonstrating such a case in JSON 
> format is seen below.
>  
>  
> {code:java}
> {
>   "inputGroups": [
>     {
>       "slices": [
>         {
>           "dataFilePath": 
> "/path/clustering_test_table/dt=2023-01-03/cf2929a7-78dc-4e99-be0c-926e9487187d-0_0-2-0_20230104102201656.parquet",
>           "deltaFilePaths": [],
>           "fileId": "cf2929a7-78dc-4e99-be0c-926e9487187d-0",
>           "partitionPath": "dt=2023-01-03",
>           "bootstrapFilePath": "",
>           "version": 1
>         }
>       ],
>       "metrics": {
>         "TOTAL_LOG_FILES": 0.0,
>         "TOTAL_IO_MB": 260.0,
>         "TOTAL_IO_READ_MB": 130.0,
>         "TOTAL_LOG_FILES_SIZE": 0.0,
>         "TOTAL_IO_WRITE_MB": 130.0
>       },
>       "numOutputFileGroups": 1,
>       "extraMetadata": null,
>       "version": 1
>     },
>     {
>       "slices": [
>         {
>           "dataFilePath": 
> "/path/clustering_test_table/dt=2023-01-04/b101162e-4813-4de6-9881-4ee0ff918f32-0_0-2-0_20230104103401458.parquet",
>           "deltaFilePaths": [],
>           "fileId": "b101162e-4813-4de6-9881-4ee0ff918f32-0",
>           "partitionPath": "dt=2023-01-04",
>           "bootstrapFilePath": "",
>           "version": 1
>         },
>         {
>           "dataFilePath": 
> "/path/clustering_test_table/dt=2023-01-04/9b1c1494-2a58-43f1-890d-4b52070937b1-0_0-2-0_20230104102201656.parquet",
>           "deltaFilePaths": [],
>           "fileId": "9b1c1494-2a58-43f1-890d-4b52070937b1-0",
>           "partitionPath": "dt=2023-01-04",
>           "bootstrapFilePath": "",
>           "version": 1
>         }
>       ],
>       "metrics": {
>         "TOTAL_LOG_FILES": 0.0,
>         "TOTAL_IO_MB": 418.0,
>         "TOTAL_IO_READ_MB": 209.0,
>         "TOTAL_LOG_FILES_SIZE": 0.0,
>         "TOTAL_IO_WRITE_MB": 209.0
>       },
>       "numOutputFileGroups": 1,
>       "extraMetadata": null,
>       "version": 1
>     }
>   ],
>   "strategy": {
>     "strategyClassName": 
> "org.apache.hudi.client.clustering.run.strategy.SparkSortAndSizeExecutionStrategy",
>     "strategyParams": {},
>     "version": 1
>   },
>   "extraMetadata": {},
>   "version": 1,
>   "preserveHoodieMetadata": true
> }{code}
>  
> Such a case will cause performance issues as a parquet file is re-written 
> unnecessarily (write amplification). 
>  
> The fix is to only select inputGroups with more than 1 fileSlice as 
> candidates for clustering.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to