The GitHub Actions job "Fory CI" on fury.git/main has failed. Run started by GitHub user pandalee99 (triggered by pandalee99).
Head commit for run: 57004e6f0b403b33041de945eeb34d1d4a36c1e1 / lzaeh <[email protected]> fix(go): fix metastringbytes inconsistency issue (#2269) <!-- **Thanks for contributing to Fory.** **If this is your first time opening a PR on fory, you can refer to [CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).** Contribution Checklist - The **Apache Fory (incubating)** community has restrictions on the naming of pr titles. You can also find instructions in [CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md). - Fory has a strong focus on performance. If the PR you submit will have an impact on performance, please benchmark it first and provide the benchmark result here. --> ## What does this PR do? **Summary:** Fixes inconsistencies between the Go implementation of `MetaStringResolver` and the reference Cython implementation. Specifically, it corrects encoding logic, hash computation, and ensures buffer output matches Python’s behavior. Also adds comprehensive test coverage for empty strings, long strings, Chinese, and Japanese strings in both Go and Python. <!-- Describe the purpose of this PR. --> **Details:** 1. Adjusted `SmallStringThreshold` in `resolver.go` to match the Cython implementation (from 8 to 16), ensuring proper classification of short vs long strings. 2. Corrected encoding logic for both small and large strings. Updated hash computation to use 64-bit MurmurHash3 with seed 47, aligning with the Cython version. 3. Added an explicit check for empty strings in `meta_string_encoder.go` to default their encoding to `UTF_8`, avoiding incorrect assignment of `LOWER_SPECIAL` as seen previously. 4. Added new test cases in `test_metastring_resolver.py`, and created a Go test file `metastring_resolver_test.go` covering edge cases such as empty strings, multilingual strings, and long strings. **How did I discover the issue** I began by studying the Cython implementation of `MetaStringResolver` to understand the intended encoding and serialization logic. After becoming familiar with the Python logic, I reviewed the Go implementation and began debugging. - **First**, I noticed the `SmallStringThreshold` was inconsistent: Python uses 16, but Go had it set to 8. This caused strings like `"hello, world"` (12 bytes) to be treated as *long* strings in Go (leading to additional hashcode bytes written), while Python treated them as *short* strings — leading to buffer mismatches.  *Fix: Updated `SmallStringThreshold` to 16 in Go.* - **Second**, I found mismatches in the encoding byte. After investigating the source of the encoding logic in Go, I discovered the computation did not match the Python logic, and it assigned incorrect encoding values. <img width="1499" alt="image" src="https://github.com/user-attachments/assets/59fb3c18-07e6-4e72-abf0-b903b37a75e1" /> *Fix: Rewrote the encoding logic to match Python's criteria.* - **Third**, during testing of long strings, I found the hashcode between Go and Python differed. Upon inspection: - Go used a 128-bit Murmur3 hash with no seed. - Python used a 64-bit Murmur3 hash with seed 47. <img width="1268" alt="image" src="https://github.com/user-attachments/assets/43ca18c7-4a7f-4452-a066-f05c93f23987" /> *Fix: Replaced Go's hash function with a 64-bit Murmur3 with seed 47, and replicated the same bit-shifting and encoding masking.* - **Fourth**, while testing empty strings, I found that Go assigned them the encoding `0x01` (`LOWER_SPECIAL`), while Python correctly assigned `0x00` (`UTF_8`). Tracing through the encoding logic, I saw that Go's encoding decision relied on flags computed from a `for` loop — which was skipped for empty strings, causing all flags to remain `true` and defaulting to `LOWER_SPECIAL`. <img width="1264" alt="image" src="https://github.com/user-attachments/assets/6d40c0ef-2768-4a1d-b755-3d0d14abc004" /> The last image actually shows the result after a successful debug. Sorry for the confusion it may have caused. *Fix: Added an explicit empty string check in `ComputeEncoding`, returning `UTF_8` early.* - All these changes were validated by inspecting serialized buffers, and outputs were compared against Python’s output step-by-step. ## Related issues close #2268 <!-- Is there any related issue? Please attach here. - #xxxx0 - #xxxx1 - #xxxx2 --> ## Does this PR introduce any user-facing change? <!-- If any user-facing interface changes, please [open an issue](https://github.com/apache/fory/issues/new/choose) describing the need to do so and update the document if necessary. --> - [ ] Does this PR introduce any public API change? - [ ] Does this PR introduce any binary protocol compatibility change? ## Benchmark <!-- When the PR has an impact on performance (if you don't know whether the PR will have an impact on performance, you can submit the PR first, and if it will have impact on performance, the code reviewer will explain it), be sure to attach a benchmark data here. --> Report URL: https://github.com/apache/fury/actions/runs/15349581923 With regards, GitHub Actions via GitBox --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
