[
https://issues.apache.org/jira/browse/HUDI-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9543:
------------------------------
Description:
If secondary key column contains null, the SI does not track those records.
Plan: Escape some char and use the unescaped version as the note for null str
Use Null character (ASCII 0) to represent null
for normal string, escape Null character (ASCII 0) to '\' + Null character
(ASCII 0)
Make sure reader and writer path works for this
h1. Hash value computation
As of today, the hash value computation is based on its unescaped value (raw
value).
{code:java}
public static int mapRecordKeyToFileGroupIndex(
String recordKey, int numFileGroups, String partitionName,
HoodieIndexVersion version) {
if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
&& version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
&& recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
return
mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
numFileGroups);
}
return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
}
// change to configurable larger group
private static int mapRecordKeyToFileGroupIndex(String recordKey, int
numFileGroups) {
int h = 0;
for (int i = 0; i < recordKey.length(); ++i) {
h = 31 * h + recordKey.charAt(i);
}
return Math.abs(Math.abs(h) % numFileGroups);
} {code}
When calculating the hash value, if the key is null, we hit NPE. We need to fix
this.
h3. Solution 1 - hard coded value
So it is proposed to give a hard coded hash value "0" for null. If it is null,
return 0.
Pros:
* does not come with the over head of escaping. Even after escape, it just
gives another different fixed coded hash value of string "."
* easy to change, scope is well under control.
h3. Solution 2 - escape first then hash
null value is first escaped as ".", then calculate the hash value.
Cons:
* comes with the overhead of escaping whenever we want to get the hash value,
for all strings.
* Does not give much difference as we still end up with some fixed hash value.
was:
If secondary key column contains null, the SI does not track those records.
Plan: Escape some char and use the unescaped version as the note for null str
Use Null character (ASCII 0) to represent null
for normal string, escape Null character (ASCII 0) to '\' + Null character
(ASCII 0)
Make sure reader and writer path works for this
> Secondary index readers and writers do not handle null char properly
> --------------------------------------------------------------------
>
> Key: HUDI-9543
> URL: https://issues.apache.org/jira/browse/HUDI-9543
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Assignee: Davis Zhang
> Priority: Critical
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> If secondary key column contains null, the SI does not track those records.
>
> Plan: Escape some char and use the unescaped version as the note for null str
> Use Null character (ASCII 0) to represent null
> for normal string, escape Null character (ASCII 0) to '\' + Null character
> (ASCII 0)
> Make sure reader and writer path works for this
>
>
> h1. Hash value computation
> As of today, the hash value computation is based on its unescaped value (raw
> value).
>
> {code:java}
> public static int mapRecordKeyToFileGroupIndex(
> String recordKey, int numFileGroups, String partitionName,
> HoodieIndexVersion version) {
> if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
> && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
> && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
> return
> mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
> numFileGroups);
> }
> return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
> }
> // change to configurable larger group
> private static int mapRecordKeyToFileGroupIndex(String recordKey, int
> numFileGroups) {
> int h = 0;
> for (int i = 0; i < recordKey.length(); ++i) {
> h = 31 * h + recordKey.charAt(i);
> }
> return Math.abs(Math.abs(h) % numFileGroups);
> } {code}
>
> When calculating the hash value, if the key is null, we hit NPE. We need to
> fix this.
>
> h3. Solution 1 - hard coded value
> So it is proposed to give a hard coded hash value "0" for null. If it is
> null, return 0.
>
> Pros:
> * does not come with the over head of escaping. Even after escape, it just
> gives another different fixed coded hash value of string "."
> * easy to change, scope is well under control.
>
> h3. Solution 2 - escape first then hash
> null value is first escaped as ".", then calculate the hash value.
> Cons:
> * comes with the overhead of escaping whenever we want to get the hash
> value, for all strings.
> * Does not give much difference as we still end up with some fixed hash
> value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)