[PR] fix(spark): support consistent hashing clustering on non-partitioned tables [hudi]

via GitHub Wed, 10 Jun 2026 08:39:16 -0700


ad1happy2go opened a new pull request, #18968:
URL: https://github.com/apache/hudi/pull/18968


   ### Describe the issue this Pull Request addresses
   
   Consistent hashing bucket-index clustering (bucket resizing) fails on 
**non-partitioned** tables.
   
   For a non-partitioned table the partition path stored in the clustering 
group metadata is an empty string (`""`). 
`SingleSparkJobConsistentHashingExecutionStrategy` validated the partition with 
a "not null **or empty**" guard and threw `IllegalArgumentException: Partition 
should not be null or empty` before any clustering work could run, so 
split/merge resizing was impossible on non-partitioned tables.
   
   JIRA: HUDI-18161
   
   ### Summary and Changelog
   
   - Relax the partition guard in 
`SingleSparkJobConsistentHashingExecutionStrategy` (both the merge and split 
paths) from `!StringUtils.isNullOrEmpty(partition)` to `partition != null`. An 
empty partition path is valid for non-partitioned tables; a genuinely absent 
metadata key (`null`) is still rejected.
   - Add a parameterized test `testResizingNonPartitioned` in 
`TestSparkConsistentBucketClustering`, mirroring the existing `testResizing`, 
covering split and merge resizing on a non-partitioned table across the 
single-job and multi-job execution strategies and the row-writer on/off paths. 
A `setup(..., boolean nonPartitioned)` overload configures the non-partition 
key generator and an empty partition-path field.
   
   ### Impact
   
   Consistent hashing clustering now works on non-partitioned tables. No 
behavior change for partitioned tables — when the partition path is non-empty 
(every partitioned table), the guard evaluates identically and the code path is 
unchanged. No public API or config changes.
   
   ### Risk Level
   
   low
   
   Partitioned-table behavior is byte-for-byte identical (the relaxed guard 
only differs when the partition path is the empty string). Verified via the new 
parameterized test and an end-to-end `spark-shell` run on Spark 4.0.2 against a 
non-partitioned MOR table with consistent-hashing bucket index: a follow-up 
write triggered inline split clustering through 
`SingleSparkJobConsistentHashingExecutionStrategy`, increasing the bucket count 
from 2 to 4 with all records preserved.
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [x] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [x] Enough context is provided in the sections above
   - [x] Adequate tests were added if applicable
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix(spark): support consistent hashing clustering on non-partitioned tables [hudi]

Reply via email to