[
https://issues.apache.org/jira/browse/HUDI-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davis Zhang updated HUDI-9543:
------------------------------
Description:
Attempted fix and spot more issue
https://github.com/Davis-Zhang-Onehouse/hudi-oss/pull/2
I wrote test exposing all SI bugs when null values kicks in. Some code involves
MDT write path which require deep dive. It will take some unknown amount of
time to get clarity on how it can be fixed e2e.
If secondary key column contains null, the SI does not track those records.
Plan: Escape some char and use the unescaped version as the note for null str
Use Null character (ASCII 0) to represent null
for normal string, escape Null character (ASCII 0) to '\' + Null character
(ASCII 0)
Make sure reader and writer path works for this
h1. Hash value computation
As of today, the hash value computation is based on its unescaped value (raw
value).
{code:java}
public static int mapRecordKeyToFileGroupIndex(
String recordKey, int numFileGroups, String partitionName,
HoodieIndexVersion version) {
if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
&& version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
&& recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
return
mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
numFileGroups);
}
return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
}
// change to configurable larger group
private static int mapRecordKeyToFileGroupIndex(String recordKey, int
numFileGroups) {
int h = 0;
for (int i = 0; i < recordKey.length(); ++i) {
h = 31 * h + recordKey.charAt(i);
}
return Math.abs(Math.abs(h) % numFileGroups);
} {code}
When calculating the hash value, if the key is null, we hit NPE. We need to fix
this.
h3. Solution 1 - hard coded value
So it is proposed to give a hard coded hash value "0" for null. If it is null,
return 0.
Pros:
* does not come with the over head of escaping. Even after escape, it just
gives another different fixed coded hash value of string "."
* easy to change, scope is well under control.
h3. Solution 2 - escape first then hash
null value is first escaped as ".", then calculate the hash value.
Cons:
* comes with the overhead of escaping whenever we want to get the hash value,
for all strings.
* Does not give much difference as we still end up with some fixed hash value.
was:
Attempted fix and spot more issue
[https://github.com/apache/hudi/pull/13489/files/38cb553c2fdeb50a34e5db3b37807a06c2783132..133510ba81b32b69ff1387c7172923127266586a]
I wrote test exposing all SI bugs when null values kicks in. Some code involves
MDT write path which require deep dive. It will take some unknown amount of
time to get clarity on how it can be fixed e2e.
If secondary key column contains null, the SI does not track those records.
Plan: Escape some char and use the unescaped version as the note for null str
Use Null character (ASCII 0) to represent null
for normal string, escape Null character (ASCII 0) to '\' + Null character
(ASCII 0)
Make sure reader and writer path works for this
h1. Hash value computation
As of today, the hash value computation is based on its unescaped value (raw
value).
{code:java}
public static int mapRecordKeyToFileGroupIndex(
String recordKey, int numFileGroups, String partitionName,
HoodieIndexVersion version) {
if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
&& version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
&& recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
return
mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
numFileGroups);
}
return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
}
// change to configurable larger group
private static int mapRecordKeyToFileGroupIndex(String recordKey, int
numFileGroups) {
int h = 0;
for (int i = 0; i < recordKey.length(); ++i) {
h = 31 * h + recordKey.charAt(i);
}
return Math.abs(Math.abs(h) % numFileGroups);
} {code}
When calculating the hash value, if the key is null, we hit NPE. We need to fix
this.
h3. Solution 1 - hard coded value
So it is proposed to give a hard coded hash value "0" for null. If it is null,
return 0.
Pros:
* does not come with the over head of escaping. Even after escape, it just
gives another different fixed coded hash value of string "."
* easy to change, scope is well under control.
h3. Solution 2 - escape first then hash
null value is first escaped as ".", then calculate the hash value.
Cons:
* comes with the overhead of escaping whenever we want to get the hash value,
for all strings.
* Does not give much difference as we still end up with some fixed hash value.
> Secondary index readers and writers do not handle null char properly
> --------------------------------------------------------------------
>
> Key: HUDI-9543
> URL: https://issues.apache.org/jira/browse/HUDI-9543
> Project: Apache Hudi
> Issue Type: Bug
> Components: index
> Affects Versions: 1.1.0
> Reporter: Davis Zhang
> Assignee: Davis Zhang
> Priority: Critical
> Labels: pull-request-available
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Attempted fix and spot more issue
> https://github.com/Davis-Zhang-Onehouse/hudi-oss/pull/2
> I wrote test exposing all SI bugs when null values kicks in. Some code
> involves MDT write path which require deep dive. It will take some unknown
> amount of time to get clarity on how it can be fixed e2e.
>
>
> If secondary key column contains null, the SI does not track those records.
>
> Plan: Escape some char and use the unescaped version as the note for null str
> Use Null character (ASCII 0) to represent null
> for normal string, escape Null character (ASCII 0) to '\' + Null character
> (ASCII 0)
> Make sure reader and writer path works for this
>
>
> h1. Hash value computation
> As of today, the hash value computation is based on its unescaped value (raw
> value).
>
> {code:java}
> public static int mapRecordKeyToFileGroupIndex(
> String recordKey, int numFileGroups, String partitionName,
> HoodieIndexVersion version) {
> if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
> && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
> && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
> return
> mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
> numFileGroups);
> }
> return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
> }
> // change to configurable larger group
> private static int mapRecordKeyToFileGroupIndex(String recordKey, int
> numFileGroups) {
> int h = 0;
> for (int i = 0; i < recordKey.length(); ++i) {
> h = 31 * h + recordKey.charAt(i);
> }
> return Math.abs(Math.abs(h) % numFileGroups);
> } {code}
>
> When calculating the hash value, if the key is null, we hit NPE. We need to
> fix this.
>
> h3. Solution 1 - hard coded value
> So it is proposed to give a hard coded hash value "0" for null. If it is
> null, return 0.
>
> Pros:
> * does not come with the over head of escaping. Even after escape, it just
> gives another different fixed coded hash value of string "."
> * easy to change, scope is well under control.
>
> h3. Solution 2 - escape first then hash
> null value is first escaped as ".", then calculate the hash value.
> Cons:
> * comes with the overhead of escaping whenever we want to get the hash
> value, for all strings.
> * Does not give much difference as we still end up with some fixed hash
> value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)