lzaeh opened a new pull request, #2269:
URL: https://github.com/apache/fury/pull/2269

   <!--
   **Thanks for contributing to Fory.**
   
   **If this is your first time opening a PR on fory, you can refer to 
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).**
   
   Contribution Checklist
   
       - The **Apache Fory (incubating)** community has restrictions on the 
naming of pr titles. You can also find instructions in 
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).
   
       - Fory has a strong focus on performance. If the PR you submit will have 
an impact on performance, please benchmark it first and provide the benchmark 
result here.
   -->
   
   ## What does this PR do?
   **Summary:**
    Fixes inconsistencies between the Go implementation of `MetaStringResolver` 
and the reference Cython implementation. Specifically, it corrects encoding 
logic, hash computation, and ensures buffer output matches Python’s behavior. 
Also adds comprehensive test coverage for empty strings, long strings, Chinese, 
and Japanese strings in both Go and Python.
   <!-- Describe the purpose of this PR. -->
   
   **Details:**
   
   1. Adjusted `SmallStringThreshold` in `resolver.go` to match the Cython 
implementation (from 8 to 16), ensuring proper classification of short vs long 
strings.
   2. Corrected encoding logic for both small and large strings. Updated hash 
computation to use 64-bit MurmurHash3 with seed 47, aligning with the Cython 
version.
   3. Added an explicit check for empty strings in `meta_string_encoder.go` to 
default their encoding to `UTF_8`, avoiding incorrect assignment of 
`LOWER_SPECIAL` as seen previously.
   4. Added new test cases in `test_metastring_resolver.py`, and created a Go 
test file `metastring_resolver_test.go` covering edge cases such as empty 
strings, multilingual strings, and long strings.
   
   **How did I discover the issue**
   
   I began by studying the Cython implementation of `MetaStringResolver` to 
understand the intended encoding and serialization logic. After becoming 
familiar with the Python logic, I reviewed the Go implementation and began 
debugging.
   
   - **First**, I noticed the `SmallStringThreshold` was inconsistent: Python 
uses 16, but Go had it set to 8. This caused strings like `"hello, world"` (12 
bytes) to be treated as *long* strings in Go (leading to additional hashcode 
bytes written), while Python treated them as *short* strings — leading to 
buffer mismatches.
   
![image](https://github.com/user-attachments/assets/a95037a5-41d4-453d-9583-d8b5e75ef4d7)
   
   
     *Fix: Updated `SmallStringThreshold` to 16 in Go.*
   
   - **Second**, I found mismatches in the encoding byte. After investigating 
the source of the encoding logic in Go, I discovered the computation did not 
match the Python logic, and it assigned incorrect encoding values.
   <img width="1499" alt="image" 
src="https://github.com/user-attachments/assets/59fb3c18-07e6-4e72-abf0-b903b37a75e1";
 />
   
     *Fix: Rewrote the encoding logic to match Python's criteria.*
   
   - **Third**, during testing of long strings, I found the hashcode between Go 
and Python differed. Upon inspection:
     - Go used a 128-bit Murmur3 hash with no seed.
     - Python used a 64-bit Murmur3 hash with seed 47.
   <img width="1268" alt="image" 
src="https://github.com/user-attachments/assets/43ca18c7-4a7f-4452-a066-f05c93f23987";
 />
   
     *Fix: Replaced Go's hash function with a 64-bit Murmur3 with seed 47, and 
replicated the same bit-shifting and encoding masking.*
   
   - **Fourth**, while testing empty strings, I found that Go assigned them the 
encoding `0x01` (`LOWER_SPECIAL`), while Python correctly assigned `0x00` 
(`UTF_8`). Tracing through the encoding logic, I saw that Go's encoding 
decision relied on flags computed from a `for` loop — which was skipped for 
empty strings, causing all flags to remain `true` and defaulting to 
`LOWER_SPECIAL`.
   <img width="1264" alt="image" 
src="https://github.com/user-attachments/assets/6d40c0ef-2768-4a1d-b755-3d0d14abc004";
 />
   
     *Fix: Added an explicit empty string check in `ComputeEncoding`, returning 
`UTF_8` early.*
   
   - All these changes were validated by inspecting serialized buffers, and 
outputs were compared against Python’s output step-by-step.
   ## Related issues
   #close 2268
   <!--
   Is there any related issue? Please attach here.
   
   - #xxxx0
   - #xxxx1
   - #xxxx2
   -->
   
   ## Does this PR introduce any user-facing change?
   
   <!--
   If any user-facing interface changes, please [open an 
issue](https://github.com/apache/fory/issues/new/choose) describing the need to 
do so and update the document if necessary.
   -->
   
   - [x] Does this PR introduce any public API change?
   - [x] Does this PR introduce any binary protocol compatibility change?
   
   ## Benchmark
   
   <!--
   When the PR has an impact on performance (if you don't know whether the PR 
will have an impact on performance, you can submit the PR first, and if it will 
have impact on performance, the code reviewer will explain it), be sure to 
attach a benchmark data here.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to