This is an automated email from the ASF dual-hosted git repository.
pandalee pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fury.git
The following commit(s) were added to refs/heads/main by this push:
new 57004e6f fix(go): fix metastringbytes inconsistency issue (#2269)
57004e6f is described below
commit 57004e6f0b403b33041de945eeb34d1d4a36c1e1
Author: lzaeh <[email protected]>
AuthorDate: Fri May 30 22:58:25 2025 +0800
fix(go): fix metastringbytes inconsistency issue (#2269)
<!--
**Thanks for contributing to Fory.**
**If this is your first time opening a PR on fory, you can refer to
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).**
Contribution Checklist
- The **Apache Fory (incubating)** community has restrictions on the
naming of pr titles. You can also find instructions in
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).
- Fory has a strong focus on performance. If the PR you submit will have
an impact on performance, please benchmark it first and provide the
benchmark result here.
-->
## What does this PR do?
**Summary:**
Fixes inconsistencies between the Go implementation of
`MetaStringResolver` and the reference Cython implementation.
Specifically, it corrects encoding logic, hash computation, and ensures
buffer output matches Python’s behavior. Also adds comprehensive test
coverage for empty strings, long strings, Chinese, and Japanese strings
in both Go and Python.
<!-- Describe the purpose of this PR. -->
**Details:**
1. Adjusted `SmallStringThreshold` in `resolver.go` to match the Cython
implementation (from 8 to 16), ensuring proper classification of short
vs long strings.
2. Corrected encoding logic for both small and large strings. Updated
hash computation to use 64-bit MurmurHash3 with seed 47, aligning with
the Cython version.
3. Added an explicit check for empty strings in `meta_string_encoder.go`
to default their encoding to `UTF_8`, avoiding incorrect assignment of
`LOWER_SPECIAL` as seen previously.
4. Added new test cases in `test_metastring_resolver.py`, and created a
Go test file `metastring_resolver_test.go` covering edge cases such as
empty strings, multilingual strings, and long strings.
**How did I discover the issue**
I began by studying the Cython implementation of `MetaStringResolver` to
understand the intended encoding and serialization logic. After becoming
familiar with the Python logic, I reviewed the Go implementation and
began debugging.
- **First**, I noticed the `SmallStringThreshold` was inconsistent:
Python uses 16, but Go had it set to 8. This caused strings like
`"hello, world"` (12 bytes) to be treated as *long* strings in Go
(leading to additional hashcode bytes written), while Python treated
them as *short* strings — leading to buffer mismatches.

*Fix: Updated `SmallStringThreshold` to 16 in Go.*
- **Second**, I found mismatches in the encoding byte. After
investigating the source of the encoding logic in Go, I discovered the
computation did not match the Python logic, and it assigned incorrect
encoding values.
<img width="1499" alt="image"
src="https://github.com/user-attachments/assets/59fb3c18-07e6-4e72-abf0-b903b37a75e1"
/>
*Fix: Rewrote the encoding logic to match Python's criteria.*
- **Third**, during testing of long strings, I found the hashcode
between Go and Python differed. Upon inspection:
- Go used a 128-bit Murmur3 hash with no seed.
- Python used a 64-bit Murmur3 hash with seed 47.
<img width="1268" alt="image"
src="https://github.com/user-attachments/assets/43ca18c7-4a7f-4452-a066-f05c93f23987"
/>
*Fix: Replaced Go's hash function with a 64-bit Murmur3 with seed 47,
and replicated the same bit-shifting and encoding masking.*
- **Fourth**, while testing empty strings, I found that Go assigned them
the encoding `0x01` (`LOWER_SPECIAL`), while Python correctly assigned
`0x00` (`UTF_8`). Tracing through the encoding logic, I saw that Go's
encoding decision relied on flags computed from a `for` loop — which was
skipped for empty strings, causing all flags to remain `true` and
defaulting to `LOWER_SPECIAL`.
<img width="1264" alt="image"
src="https://github.com/user-attachments/assets/6d40c0ef-2768-4a1d-b755-3d0d14abc004"
/>
The last image actually shows the result after a successful debug. Sorry
for the confusion it may have caused.
*Fix: Added an explicit empty string check in `ComputeEncoding`,
returning `UTF_8` early.*
- All these changes were validated by inspecting serialized buffers, and
outputs were compared against Python’s output step-by-step.
## Related issues
close #2268
<!--
Is there any related issue? Please attach here.
- #xxxx0
- #xxxx1
- #xxxx2
-->
## Does this PR introduce any user-facing change?
<!--
If any user-facing interface changes, please [open an
issue](https://github.com/apache/fory/issues/new/choose) describing the
need to do so and update the document if necessary.
-->
- [ ] Does this PR introduce any public API change?
- [ ] Does this PR introduce any binary protocol compatibility change?
## Benchmark
<!--
When the PR has an impact on performance (if you don't know whether the
PR will have an impact on performance, you can submit the PR first, and
if it will have impact on performance, the code reviewer will explain
it), be sure to attach a benchmark data here.
-->
---
go/fory/meta/meta_string_encoder.go | 4 +
go/fory/resolver.go | 12 ++-
go/fory/test/metastring_resolver_test.go | 99 +++++++++++++++++++++++++
python/pyfory/tests/test_metastring_resolver.py | 29 ++++++++
4 files changed, 137 insertions(+), 7 deletions(-)
diff --git a/go/fory/meta/meta_string_encoder.go
b/go/fory/meta/meta_string_encoder.go
index 419e41e2..171eff42 100644
--- a/go/fory/meta/meta_string_encoder.go
+++ b/go/fory/meta/meta_string_encoder.go
@@ -157,6 +157,10 @@ func (e *Encoder) EncodeGeneric(chars []byte, bitsPerChar
int) (result []byte, e
}
func (e *Encoder) ComputeEncoding(input string) Encoding {
+ // Special case for empty string: default to UTF_8 encoding
+ if len(input) == 0 {
+ return UTF_8
+ }
statistics := e.computeStringStatistics(input)
if statistics.canLowerSpecialEncoded {
return LOWER_SPECIAL
diff --git a/go/fory/resolver.go b/go/fory/resolver.go
index b8d5c3e1..f47938c0 100644
--- a/go/fory/resolver.go
+++ b/go/fory/resolver.go
@@ -27,7 +27,7 @@ import (
// Constants for string handling
const (
- SmallStringThreshold = 8 // Maximum length for "small" strings
+ SmallStringThreshold = 16 // Maximum length for "small" strings
DefaultDynamicWriteMetaStrID = -1 // Default ID for dynamic strings
)
@@ -223,14 +223,12 @@ func (r *MetaStringResolver) GetMetaStrBytes(metastr
*meta.MetaString) *MetaStri
binary.Read(bytes.NewReader(data[:8]),
binary.LittleEndian, &v1)
v2 = bytesToInt64(data[8:])
}
- hashcode = ((v1*31 + v2) >> 8 << 8) |
int64(metastr.GetEncodedBytes()[0])
+ hashcode = ((v1*31 + v2) >> 8 << 8) |
int64(metastr.GetEncoding())
} else {
// Large string: use MurmurHash3
- hash := murmur3.New128()
- hash.Write(data)
- h1, h2 := hash.Sum128()
- hashcode = (int64(h1)<<32 | int64(h2)) >> 8 << 8
- hashcode |= int64(metastr.GetEncodedBytes()[0])
+ h64 := murmur3.Sum64WithSeed(data, 47)
+ hashcode = int64((h64 >> 8) << 8)
+ hashcode |= int64(metastr.GetEncoding())
}
// Create and cache new instance
diff --git a/go/fory/test/metastring_resolver_test.go
b/go/fory/test/metastring_resolver_test.go
new file mode 100644
index 00000000..58dc0943
--- /dev/null
+++ b/go/fory/test/metastring_resolver_test.go
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements. See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership. The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License. You may obtain a copy of the License at
+//
+// http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied. See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package fory
+
+import (
+ "bytes"
+ "strings"
+ "testing"
+
+ "github.com/apache/fory/go/fory/meta"
+)
+
+func TestMetaStringResolver(t *testing.T) {
+ resolver := NewMetaStringResolver()
+ encoder := meta.NewEncoder('$', '_')
+ buffer := NewByteBuffer(make([]byte, 512)) // Allocate enough space
+
+ // Test 1: Regular English string
+ metaStr1, _ := encoder.Encode("hello, world")
+ metaBytes1 := resolver.GetMetaStrBytes(&metaStr1)
+ if err := resolver.WriteMetaStringBytes(buffer, metaBytes1); err != nil
{
+ t.Fatalf("write failed: %v", err)
+ }
+ got1, _ := resolver.ReadMetaStringBytes(buffer)
+ if got1.Hashcode != metaBytes1.Hashcode || !bytes.Equal(got1.Data,
metaBytes1.Data) {
+ t.Errorf("Mismatch in English string")
+ }
+
+ // Test 2: Manually constructed MetaStringBytes
+ data2 := []byte{0xBF, 0x05, 0xA4, 0x71, 0xA9, 0x92, 0x53, 0x96, 0xA6,
0x49, 0x4F, 0x72, 0x9C, 0x68, 0x29, 0x80}
+ metaBytes2 := NewMetaStringBytes(data2, int64(-5456063526933366015))
+ if err := resolver.WriteMetaStringBytes(buffer, metaBytes2); err != nil
{
+ t.Fatalf("write failed: %v", err)
+ }
+ got2, _ := resolver.ReadMetaStringBytes(buffer)
+ if got2.Hashcode != metaBytes2.Hashcode || !bytes.Equal(got2.Data,
metaBytes2.Data) {
+ t.Errorf("Mismatch in custom data")
+ }
+
+ // Test 3: Empty string
+ metaStr3, _ := encoder.Encode("")
+ metaBytes3 := resolver.GetMetaStrBytes(&metaStr3)
+ if err := resolver.WriteMetaStringBytes(buffer, metaBytes3); err != nil
{
+ t.Fatalf("write failed: %v", err)
+ }
+ got3, _ := resolver.ReadMetaStringBytes(buffer)
+ if got3.Hashcode != metaBytes3.Hashcode || !bytes.Equal(got3.Data,
metaBytes3.Data) {
+ t.Errorf("Mismatch in empty string")
+ }
+
+ // Test 4: Chinese string
+ metaStr4, _ := encoder.Encode("你好,世界")
+ metaBytes4 := resolver.GetMetaStrBytes(&metaStr4)
+ if err := resolver.WriteMetaStringBytes(buffer, metaBytes4); err != nil
{
+ t.Fatalf("write failed: %v", err)
+ }
+ got4, _ := resolver.ReadMetaStringBytes(buffer)
+ if got4.Hashcode != metaBytes4.Hashcode || !bytes.Equal(got4.Data,
metaBytes4.Data) {
+ t.Errorf("Mismatch in Chinese string")
+ }
+
+ // Test 5: Japanese string (more than 16 bytes, triggers hash-based
encoding)
+ metaStr5, _ := encoder.Encode("こんにちは世界")
+ metaBytes5 := resolver.GetMetaStrBytes(&metaStr5)
+ if err := resolver.WriteMetaStringBytes(buffer, metaBytes5); err != nil
{
+ t.Fatalf("write failed: %v", err)
+ }
+ got5, _ := resolver.ReadMetaStringBytes(buffer)
+ if got5.Hashcode != metaBytes5.Hashcode || !bytes.Equal(got5.Data,
metaBytes5.Data) {
+ t.Errorf("Mismatch in Japanese string")
+ }
+
+ // Test 6: Long string (more than 16 bytes, triggers hash-based
encoding)
+ longStr := strings.Repeat("hello, world", 10)
+ metaStr6, _ := encoder.Encode(longStr)
+ metaBytes6 := resolver.GetMetaStrBytes(&metaStr6)
+ if err := resolver.WriteMetaStringBytes(buffer, metaBytes6); err != nil
{
+ t.Fatalf("write failed: %v", err)
+ }
+ got6, _ := resolver.ReadMetaStringBytes(buffer)
+ if got6.Hashcode != metaBytes6.Hashcode || !bytes.Equal(got6.Data,
metaBytes6.Data) {
+ t.Errorf("Mismatch in long string")
+ }
+}
diff --git a/python/pyfory/tests/test_metastring_resolver.py
b/python/pyfory/tests/test_metastring_resolver.py
index 22846afb..a2ecd5af 100644
--- a/python/pyfory/tests/test_metastring_resolver.py
+++ b/python/pyfory/tests/test_metastring_resolver.py
@@ -23,14 +23,43 @@ from pyfory.meta.metastring import MetaStringEncoder
def test_metastring_resolver():
resolver = MetaStringResolver()
encoder = MetaStringEncoder("$", "_")
+
+ # Test 1: Regular English string
metastr1 = encoder.encode("hello, world")
metabytes1 = resolver.get_metastr_bytes(metastr1)
buffer = Buffer.allocate(32)
resolver.write_meta_string_bytes(buffer, metabytes1)
assert resolver.read_meta_string_bytes(buffer) == metabytes1
+
+ # Test 2: Manually constructed MetaStringBytes
metabytes2 = MetaStringBytes(
data=b"\xbf\x05\xa4q\xa9\x92S\x96\xa6IOr\x9ch)\x80",
hashcode=-5456063526933366015,
)
resolver.write_meta_string_bytes(buffer, metabytes2)
assert resolver.read_meta_string_bytes(buffer) == metabytes2
+
+ # Test 3: Empty string
+ metastr_null = encoder.encode("")
+ metabytes_null = resolver.get_metastr_bytes(metastr_null)
+ resolver.write_meta_string_bytes(buffer, metabytes_null)
+ assert resolver.read_meta_string_bytes(buffer) == metabytes_null
+
+ # Test 4: Chinese string
+ metastr_cn = encoder.encode("你好,世界")
+ metabytes_cn = resolver.get_metastr_bytes(metastr_cn)
+ resolver.write_meta_string_bytes(buffer, metabytes_cn)
+ assert resolver.read_meta_string_bytes(buffer) == metabytes_cn
+
+ # Test 5: Japanese string (more than 16 bytes, triggers hash-based
encoding)
+ metastr_jp = encoder.encode("こんにちは世界")
+ metabytes_jp = resolver.get_metastr_bytes(metastr_jp)
+ resolver.write_meta_string_bytes(buffer, metabytes_jp)
+ assert resolver.read_meta_string_bytes(buffer) == metabytes_jp
+
+ # Test 6: Long string (more than 16 bytes, triggers hash-based encoding)
+ long_str = "hello, world" * 10
+ metastr_long = encoder.encode(long_str)
+ metabytes_long = resolver.get_metastr_bytes(metastr_long)
+ resolver.write_meta_string_bytes(buffer, metabytes_long)
+ assert resolver.read_meta_string_bytes(buffer) == metabytes_long
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]