[ 
https://issues.apache.org/jira/browse/HUDI-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davis Zhang updated HUDI-8489:
------------------------------
    Description: 
Secondary index key is a combination of secondaryKey and recordKey - "the 
payload key is in the format of "secondaryKey$primaryKey"". There are two ways 
to encode with a delimiter ($):
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

Ran a benchmark to compare encoding/decoding time and did not find much 
difference - [https://gist.github.com/codope/b1c73abed748d77c0b4db974d835f9da]

  was:
Secondary index key is a combination of secondaryKey and recordKey. There are 
two ways to encode with a delimiter ($):
 # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a neat 
and standard way to encode. Might not be very efficient for long strings? But, 
base64 is a standard scheme.
 # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER + 
escapeSpecialChars(recordKey)`. The keys are readable and preserves the order. 
This is a custom scheme not used in other systems.

Ran a benchmark to compare encoding/decoding time and did not find much 
difference - https://gist.github.com/codope/b1c73abed748d77c0b4db974d835f9da


> Fix encoding of secondary index key
> -----------------------------------
>
>                 Key: HUDI-8489
>                 URL: https://issues.apache.org/jira/browse/HUDI-8489
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Sagar Sumit
>            Assignee: Sagar Sumit
>            Priority: Blocker
>             Fix For: 1.1.0
>
>
> Secondary index key is a combination of secondaryKey and recordKey - "the 
> payload key is in the format of "secondaryKey$primaryKey"". There are two 
> ways to encode with a delimiter ($):
>  # Run base64 encoding: `Base64.encode(secondaryKey) + DELIMITER + 
> Base64.encode(recordKey)`.  Base64 does not map to $. So, this gives us a 
> neat and standard way to encode. Might not be very efficient for long 
> strings? But, base64 is a standard scheme.
>  # Escape special characters:  `escapeSpecialChars(secondaryKey) + DELIMITER 
> + escapeSpecialChars(recordKey)`. The keys are readable and preserves the 
> order. This is a custom scheme not used in other systems.
> Ran a benchmark to compare encoding/decoding time and did not find much 
> difference - [https://gist.github.com/codope/b1c73abed748d77c0b4db974d835f9da]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to