[GH] (fury/main): Workflow run "Fory CI" failed!

GitBox Fri, 30 May 2025 11:15:35 -0700


The GitHub Actions job "Fory CI" on fury.git/main has failed.
Run started by GitHub user pandalee99 (triggered by pandalee99).


Head commit for run:
57004e6f0b403b33041de945eeb34d1d4a36c1e1 / lzaeh 
<[email protected]>
fix(go): fix metastringbytes inconsistency issue (#2269)

<!--
**Thanks for contributing to Fory.**

**If this is your first time opening a PR on fory, you can refer to
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).**

Contribution Checklist

- The **Apache Fory (incubating)** community has restrictions on the
naming of pr titles. You can also find instructions in
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).

- Fory has a strong focus on performance. If the PR you submit will have
an impact on performance, please benchmark it first and provide the
benchmark result here.
-->

## What does this PR do?
**Summary:**
Fixes inconsistencies between the Go implementation of
`MetaStringResolver` and the reference Cython implementation.
Specifically, it corrects encoding logic, hash computation, and ensures
buffer output matches Python’s behavior. Also adds comprehensive test
coverage for empty strings, long strings, Chinese, and Japanese strings
in both Go and Python.
<!-- Describe the purpose of this PR. -->

**Details:**

1. Adjusted `SmallStringThreshold` in `resolver.go` to match the Cython
implementation (from 8 to 16), ensuring proper classification of short
vs long strings.
2. Corrected encoding logic for both small and large strings. Updated
hash computation to use 64-bit MurmurHash3 with seed 47, aligning with
the Cython version.
3. Added an explicit check for empty strings in `meta_string_encoder.go`
to default their encoding to `UTF_8`, avoiding incorrect assignment of
`LOWER_SPECIAL` as seen previously.
4. Added new test cases in `test_metastring_resolver.py`, and created a
Go test file `metastring_resolver_test.go` covering edge cases such as
empty strings, multilingual strings, and long strings.

**How did I discover the issue**

I began by studying the Cython implementation of `MetaStringResolver` to
understand the intended encoding and serialization logic. After becoming
familiar with the Python logic, I reviewed the Go implementation and
began debugging.

- **First**, I noticed the `SmallStringThreshold` was inconsistent:
Python uses 16, but Go had it set to 8. This caused strings like
`"hello, world"` (12 bytes) to be treated as *long* strings in Go
(leading to additional hashcode bytes written), while Python treated
them as *short* strings — leading to buffer mismatches.

![image](https://github.com/user-attachments/assets/a95037a5-41d4-453d-9583-d8b5e75ef4d7)


  *Fix: Updated `SmallStringThreshold` to 16 in Go.*

- **Second**, I found mismatches in the encoding byte. After
investigating the source of the encoding logic in Go, I discovered the
computation did not match the Python logic, and it assigned incorrect
encoding values.
<img width="1499" alt="image"
src="https://github.com/user-attachments/assets/59fb3c18-07e6-4e72-abf0-b903b37a75e1";
/>

  *Fix: Rewrote the encoding logic to match Python's criteria.*

- **Third**, during testing of long strings, I found the hashcode
between Go and Python differed. Upon inspection:
  - Go used a 128-bit Murmur3 hash with no seed.
  - Python used a 64-bit Murmur3 hash with seed 47.
<img width="1268" alt="image"
src="https://github.com/user-attachments/assets/43ca18c7-4a7f-4452-a066-f05c93f23987";
/>

*Fix: Replaced Go's hash function with a 64-bit Murmur3 with seed 47,
and replicated the same bit-shifting and encoding masking.*

- **Fourth**, while testing empty strings, I found that Go assigned them
the encoding `0x01` (`LOWER_SPECIAL`), while Python correctly assigned
`0x00` (`UTF_8`). Tracing through the encoding logic, I saw that Go's
encoding decision relied on flags computed from a `for` loop — which was
skipped for empty strings, causing all flags to remain `true` and
defaulting to `LOWER_SPECIAL`.
<img width="1264" alt="image"
src="https://github.com/user-attachments/assets/6d40c0ef-2768-4a1d-b755-3d0d14abc004";
/>
The last image actually shows the result after a successful debug. Sorry
for the confusion it may have caused.

*Fix: Added an explicit empty string check in `ComputeEncoding`,
returning `UTF_8` early.*

- All these changes were validated by inspecting serialized buffers, and
outputs were compared against Python’s output step-by-step.
## Related issues
close #2268
<!--
Is there any related issue? Please attach here.

- #xxxx0
- #xxxx1
- #xxxx2
-->

## Does this PR introduce any user-facing change?

<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/fory/issues/new/choose) describing the
need to do so and update the document if necessary.
-->

- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?

## Benchmark

<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->

Report URL: https://github.com/apache/fury/actions/runs/15349581923

With regards,
GitHub Actions via GitBox


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GH] (fury/main): Workflow run "Fory CI" failed!

Reply via email to