[jira] [Updated] (HUDI-9543) Secondary index readers and writers do not handle null char properly

Davis Zhang (Jira) Fri, 27 Jun 2025 16:20:25 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-9543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Davis Zhang updated HUDI-9543:
------------------------------
    Description: 
Attempted fix and spot more issue

https://github.com/Davis-Zhang-Onehouse/hudi-oss/pull/2

I wrote test exposing all SI bugs when null values kicks in. Some code involves 
MDT write path which require deep dive. It will take some unknown amount of 
time to get clarity on how it can be fixed e2e.

 

 

If secondary key column contains null, the SI does not track those records.

 

Plan: Escape some char and use the unescaped version as the note for null str

Use Null character (ASCII 0) to represent null

for normal string, escape Null character (ASCII 0) to '\' + Null character 
(ASCII 0)

Make sure reader and writer path works for this

 

 
h1. Hash value computation

As of today, the hash value computation is based on its unescaped value (raw 
value). 

 
{code:java}
public static int mapRecordKeyToFileGroupIndex(
    String recordKey, int numFileGroups, String partitionName, 
HoodieIndexVersion version) {
  if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
      && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
      && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
    return 
mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
 numFileGroups);
  }
  return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
}

// change to configurable larger group
private static int mapRecordKeyToFileGroupIndex(String recordKey, int 
numFileGroups) {
  int h = 0;
  for (int i = 0; i < recordKey.length(); ++i) {
    h = 31 * h + recordKey.charAt(i);
  }

  return Math.abs(Math.abs(h) % numFileGroups);
} {code}
 

When calculating the hash value, if the key is null, we hit NPE. We need to fix 
this.

 
h3. Solution 1 - hard coded value

So it is proposed to give a hard coded hash value "0" for null. If it is null, 
return 0.

 

Pros:
 * does not come with the over head of escaping. Even after escape, it just 
gives another different fixed coded hash value of string "."
 * easy to change, scope is well under control.

 
h3. Solution 2 - escape first then hash

null value is first escaped as ".", then calculate the hash value.

Cons:
 * comes with the overhead of escaping whenever we want to get the hash value, 
for all strings.
 * Does not give much difference as we still end up with some fixed hash value.

  was:
Attempted fix and spot more issue

[https://github.com/apache/hudi/pull/13489/files/38cb553c2fdeb50a34e5db3b37807a06c2783132..133510ba81b32b69ff1387c7172923127266586a]

I wrote test exposing all SI bugs when null values kicks in. Some code involves 
MDT write path which require deep dive. It will take some unknown amount of 
time to get clarity on how it can be fixed e2e.

 

 

If secondary key column contains null, the SI does not track those records.

 

Plan: Escape some char and use the unescaped version as the note for null str

Use Null character (ASCII 0) to represent null

for normal string, escape Null character (ASCII 0) to '\' + Null character 
(ASCII 0)

Make sure reader and writer path works for this

 

 
h1. Hash value computation

As of today, the hash value computation is based on its unescaped value (raw 
value). 

 
{code:java}
public static int mapRecordKeyToFileGroupIndex(
    String recordKey, int numFileGroups, String partitionName, 
HoodieIndexVersion version) {
  if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
      && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
      && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
    return 
mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
 numFileGroups);
  }
  return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
}

// change to configurable larger group
private static int mapRecordKeyToFileGroupIndex(String recordKey, int 
numFileGroups) {
  int h = 0;
  for (int i = 0; i < recordKey.length(); ++i) {
    h = 31 * h + recordKey.charAt(i);
  }

  return Math.abs(Math.abs(h) % numFileGroups);
} {code}
 

When calculating the hash value, if the key is null, we hit NPE. We need to fix 
this.

 
h3. Solution 1 - hard coded value

So it is proposed to give a hard coded hash value "0" for null. If it is null, 
return 0.

 

Pros:
 * does not come with the over head of escaping. Even after escape, it just 
gives another different fixed coded hash value of string "."
 * easy to change, scope is well under control.

 
h3. Solution 2 - escape first then hash

null value is first escaped as ".", then calculate the hash value.

Cons:
 * comes with the overhead of escaping whenever we want to get the hash value, 
for all strings.
 * Does not give much difference as we still end up with some fixed hash value.


> Secondary index readers and writers do not handle null char properly
> --------------------------------------------------------------------
>
>                 Key: HUDI-9543
>                 URL: https://issues.apache.org/jira/browse/HUDI-9543
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>    Affects Versions: 1.1.0
>            Reporter: Davis Zhang
>            Assignee: Davis Zhang
>            Priority: Critical
>              Labels: pull-request-available
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Attempted fix and spot more issue
> https://github.com/Davis-Zhang-Onehouse/hudi-oss/pull/2
> I wrote test exposing all SI bugs when null values kicks in. Some code 
> involves MDT write path which require deep dive. It will take some unknown 
> amount of time to get clarity on how it can be fixed e2e.
>  
>  
> If secondary key column contains null, the SI does not track those records.
>  
> Plan: Escape some char and use the unescaped version as the note for null str
> Use Null character (ASCII 0) to represent null
> for normal string, escape Null character (ASCII 0) to '\' + Null character 
> (ASCII 0)
> Make sure reader and writer path works for this
>  
>  
> h1. Hash value computation
> As of today, the hash value computation is based on its unescaped value (raw 
> value). 
>  
> {code:java}
> public static int mapRecordKeyToFileGroupIndex(
>     String recordKey, int numFileGroups, String partitionName, 
> HoodieIndexVersion version) {
>   if (MetadataPartitionType.SECONDARY_INDEX.isPartitionType(partitionName)
>       && version.greaterThanOrEquals(HoodieIndexVersion.SECONDARY_INDEX_TWO)
>       && recordKey.contains(SECONDARY_INDEX_RECORD_KEY_SEPARATOR)) {
>     return 
> mapRecordKeyToFileGroupIndex(SecondaryIndexKeyUtils.getSecondaryKeyFromSecondaryIndexKey(recordKey),
>  numFileGroups);
>   }
>   return mapRecordKeyToFileGroupIndex(recordKey, numFileGroups);
> }
> // change to configurable larger group
> private static int mapRecordKeyToFileGroupIndex(String recordKey, int 
> numFileGroups) {
>   int h = 0;
>   for (int i = 0; i < recordKey.length(); ++i) {
>     h = 31 * h + recordKey.charAt(i);
>   }
>   return Math.abs(Math.abs(h) % numFileGroups);
> } {code}
>  
> When calculating the hash value, if the key is null, we hit NPE. We need to 
> fix this.
>  
> h3. Solution 1 - hard coded value
> So it is proposed to give a hard coded hash value "0" for null. If it is 
> null, return 0.
>  
> Pros:
>  * does not come with the over head of escaping. Even after escape, it just 
> gives another different fixed coded hash value of string "."
>  * easy to change, scope is well under control.
>  
> h3. Solution 2 - escape first then hash
> null value is first escaped as ".", then calculate the hash value.
> Cons:
>  * comes with the overhead of escaping whenever we want to get the hash 
> value, for all strings.
>  * Does not give much difference as we still end up with some fixed hash 
> value.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-9543) Secondary index readers and writers do not handle null char properly

Reply via email to