This is an automated email from the ASF dual-hosted git repository.

pandalee pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fury.git


The following commit(s) were added to refs/heads/main by this push:
     new 57004e6f fix(go): fix metastringbytes inconsistency issue (#2269)
57004e6f is described below

commit 57004e6f0b403b33041de945eeb34d1d4a36c1e1
Author: lzaeh <[email protected]>
AuthorDate: Fri May 30 22:58:25 2025 +0800

    fix(go): fix metastringbytes inconsistency issue (#2269)
    
    <!--
    **Thanks for contributing to Fory.**
    
    **If this is your first time opening a PR on fory, you can refer to
    
[CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).**
    
    Contribution Checklist
    
    - The **Apache Fory (incubating)** community has restrictions on the
    naming of pr titles. You can also find instructions in
    [CONTRIBUTING.md](https://github.com/apache/fory/blob/main/CONTRIBUTING.md).
    
    - Fory has a strong focus on performance. If the PR you submit will have
    an impact on performance, please benchmark it first and provide the
    benchmark result here.
    -->
    
    ## What does this PR do?
    **Summary:**
    Fixes inconsistencies between the Go implementation of
    `MetaStringResolver` and the reference Cython implementation.
    Specifically, it corrects encoding logic, hash computation, and ensures
    buffer output matches Python’s behavior. Also adds comprehensive test
    coverage for empty strings, long strings, Chinese, and Japanese strings
    in both Go and Python.
    <!-- Describe the purpose of this PR. -->
    
    **Details:**
    
    1. Adjusted `SmallStringThreshold` in `resolver.go` to match the Cython
    implementation (from 8 to 16), ensuring proper classification of short
    vs long strings.
    2. Corrected encoding logic for both small and large strings. Updated
    hash computation to use 64-bit MurmurHash3 with seed 47, aligning with
    the Cython version.
    3. Added an explicit check for empty strings in `meta_string_encoder.go`
    to default their encoding to `UTF_8`, avoiding incorrect assignment of
    `LOWER_SPECIAL` as seen previously.
    4. Added new test cases in `test_metastring_resolver.py`, and created a
    Go test file `metastring_resolver_test.go` covering edge cases such as
    empty strings, multilingual strings, and long strings.
    
    **How did I discover the issue**
    
    I began by studying the Cython implementation of `MetaStringResolver` to
    understand the intended encoding and serialization logic. After becoming
    familiar with the Python logic, I reviewed the Go implementation and
    began debugging.
    
    - **First**, I noticed the `SmallStringThreshold` was inconsistent:
    Python uses 16, but Go had it set to 8. This caused strings like
    `"hello, world"` (12 bytes) to be treated as *long* strings in Go
    (leading to additional hashcode bytes written), while Python treated
    them as *short* strings — leading to buffer mismatches.
    
    
![image](https://github.com/user-attachments/assets/a95037a5-41d4-453d-9583-d8b5e75ef4d7)
    
    
      *Fix: Updated `SmallStringThreshold` to 16 in Go.*
    
    - **Second**, I found mismatches in the encoding byte. After
    investigating the source of the encoding logic in Go, I discovered the
    computation did not match the Python logic, and it assigned incorrect
    encoding values.
    <img width="1499" alt="image"
    
src="https://github.com/user-attachments/assets/59fb3c18-07e6-4e72-abf0-b903b37a75e1";
    />
    
      *Fix: Rewrote the encoding logic to match Python's criteria.*
    
    - **Third**, during testing of long strings, I found the hashcode
    between Go and Python differed. Upon inspection:
      - Go used a 128-bit Murmur3 hash with no seed.
      - Python used a 64-bit Murmur3 hash with seed 47.
    <img width="1268" alt="image"
    
src="https://github.com/user-attachments/assets/43ca18c7-4a7f-4452-a066-f05c93f23987";
    />
    
    *Fix: Replaced Go's hash function with a 64-bit Murmur3 with seed 47,
    and replicated the same bit-shifting and encoding masking.*
    
    - **Fourth**, while testing empty strings, I found that Go assigned them
    the encoding `0x01` (`LOWER_SPECIAL`), while Python correctly assigned
    `0x00` (`UTF_8`). Tracing through the encoding logic, I saw that Go's
    encoding decision relied on flags computed from a `for` loop — which was
    skipped for empty strings, causing all flags to remain `true` and
    defaulting to `LOWER_SPECIAL`.
    <img width="1264" alt="image"
    
src="https://github.com/user-attachments/assets/6d40c0ef-2768-4a1d-b755-3d0d14abc004";
    />
    The last image actually shows the result after a successful debug. Sorry
    for the confusion it may have caused.
    
    *Fix: Added an explicit empty string check in `ComputeEncoding`,
    returning `UTF_8` early.*
    
    - All these changes were validated by inspecting serialized buffers, and
    outputs were compared against Python’s output step-by-step.
    ## Related issues
    close #2268
    <!--
    Is there any related issue? Please attach here.
    
    - #xxxx0
    - #xxxx1
    - #xxxx2
    -->
    
    ## Does this PR introduce any user-facing change?
    
    <!--
    If any user-facing interface changes, please [open an
    issue](https://github.com/apache/fory/issues/new/choose) describing the
    need to do so and update the document if necessary.
    -->
    
    - [ ] Does this PR introduce any public API change?
    - [ ] Does this PR introduce any binary protocol compatibility change?
    
    ## Benchmark
    
    <!--
    When the PR has an impact on performance (if you don't know whether the
    PR will have an impact on performance, you can submit the PR first, and
    if it will have impact on performance, the code reviewer will explain
    it), be sure to attach a benchmark data here.
    -->
---
 go/fory/meta/meta_string_encoder.go             |  4 +
 go/fory/resolver.go                             | 12 ++-
 go/fory/test/metastring_resolver_test.go        | 99 +++++++++++++++++++++++++
 python/pyfory/tests/test_metastring_resolver.py | 29 ++++++++
 4 files changed, 137 insertions(+), 7 deletions(-)

diff --git a/go/fory/meta/meta_string_encoder.go 
b/go/fory/meta/meta_string_encoder.go
index 419e41e2..171eff42 100644
--- a/go/fory/meta/meta_string_encoder.go
+++ b/go/fory/meta/meta_string_encoder.go
@@ -157,6 +157,10 @@ func (e *Encoder) EncodeGeneric(chars []byte, bitsPerChar 
int) (result []byte, e
 }
 
 func (e *Encoder) ComputeEncoding(input string) Encoding {
+       // Special case for empty string: default to UTF_8 encoding
+       if len(input) == 0 {
+               return UTF_8
+       }
        statistics := e.computeStringStatistics(input)
        if statistics.canLowerSpecialEncoded {
                return LOWER_SPECIAL
diff --git a/go/fory/resolver.go b/go/fory/resolver.go
index b8d5c3e1..f47938c0 100644
--- a/go/fory/resolver.go
+++ b/go/fory/resolver.go
@@ -27,7 +27,7 @@ import (
 
 // Constants for string handling
 const (
-       SmallStringThreshold         = 8  // Maximum length for "small" strings
+       SmallStringThreshold         = 16 // Maximum length for "small" strings
        DefaultDynamicWriteMetaStrID = -1 // Default ID for dynamic strings
 )
 
@@ -223,14 +223,12 @@ func (r *MetaStringResolver) GetMetaStrBytes(metastr 
*meta.MetaString) *MetaStri
                        binary.Read(bytes.NewReader(data[:8]), 
binary.LittleEndian, &v1)
                        v2 = bytesToInt64(data[8:])
                }
-               hashcode = ((v1*31 + v2) >> 8 << 8) | 
int64(metastr.GetEncodedBytes()[0])
+               hashcode = ((v1*31 + v2) >> 8 << 8) | 
int64(metastr.GetEncoding())
        } else {
                // Large string: use MurmurHash3
-               hash := murmur3.New128()
-               hash.Write(data)
-               h1, h2 := hash.Sum128()
-               hashcode = (int64(h1)<<32 | int64(h2)) >> 8 << 8
-               hashcode |= int64(metastr.GetEncodedBytes()[0])
+               h64 := murmur3.Sum64WithSeed(data, 47)
+               hashcode = int64((h64 >> 8) << 8)
+               hashcode |= int64(metastr.GetEncoding())
        }
 
        // Create and cache new instance
diff --git a/go/fory/test/metastring_resolver_test.go 
b/go/fory/test/metastring_resolver_test.go
new file mode 100644
index 00000000..58dc0943
--- /dev/null
+++ b/go/fory/test/metastring_resolver_test.go
@@ -0,0 +1,99 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+package fory
+
+import (
+       "bytes"
+       "strings"
+       "testing"
+
+       "github.com/apache/fory/go/fory/meta"
+)
+
+func TestMetaStringResolver(t *testing.T) {
+       resolver := NewMetaStringResolver()
+       encoder := meta.NewEncoder('$', '_')
+       buffer := NewByteBuffer(make([]byte, 512)) // Allocate enough space
+
+       // Test 1: Regular English string
+       metaStr1, _ := encoder.Encode("hello, world")
+       metaBytes1 := resolver.GetMetaStrBytes(&metaStr1)
+       if err := resolver.WriteMetaStringBytes(buffer, metaBytes1); err != nil 
{
+               t.Fatalf("write failed: %v", err)
+       }
+       got1, _ := resolver.ReadMetaStringBytes(buffer)
+       if got1.Hashcode != metaBytes1.Hashcode || !bytes.Equal(got1.Data, 
metaBytes1.Data) {
+               t.Errorf("Mismatch in English string")
+       }
+
+       // Test 2: Manually constructed MetaStringBytes
+       data2 := []byte{0xBF, 0x05, 0xA4, 0x71, 0xA9, 0x92, 0x53, 0x96, 0xA6, 
0x49, 0x4F, 0x72, 0x9C, 0x68, 0x29, 0x80}
+       metaBytes2 := NewMetaStringBytes(data2, int64(-5456063526933366015))
+       if err := resolver.WriteMetaStringBytes(buffer, metaBytes2); err != nil 
{
+               t.Fatalf("write failed: %v", err)
+       }
+       got2, _ := resolver.ReadMetaStringBytes(buffer)
+       if got2.Hashcode != metaBytes2.Hashcode || !bytes.Equal(got2.Data, 
metaBytes2.Data) {
+               t.Errorf("Mismatch in custom data")
+       }
+
+       // Test 3: Empty string
+       metaStr3, _ := encoder.Encode("")
+       metaBytes3 := resolver.GetMetaStrBytes(&metaStr3)
+       if err := resolver.WriteMetaStringBytes(buffer, metaBytes3); err != nil 
{
+               t.Fatalf("write failed: %v", err)
+       }
+       got3, _ := resolver.ReadMetaStringBytes(buffer)
+       if got3.Hashcode != metaBytes3.Hashcode || !bytes.Equal(got3.Data, 
metaBytes3.Data) {
+               t.Errorf("Mismatch in empty string")
+       }
+
+       // Test 4: Chinese string
+       metaStr4, _ := encoder.Encode("你好,世界")
+       metaBytes4 := resolver.GetMetaStrBytes(&metaStr4)
+       if err := resolver.WriteMetaStringBytes(buffer, metaBytes4); err != nil 
{
+               t.Fatalf("write failed: %v", err)
+       }
+       got4, _ := resolver.ReadMetaStringBytes(buffer)
+       if got4.Hashcode != metaBytes4.Hashcode || !bytes.Equal(got4.Data, 
metaBytes4.Data) {
+               t.Errorf("Mismatch in Chinese string")
+       }
+
+       // Test 5: Japanese string (more than 16 bytes, triggers hash-based 
encoding)
+       metaStr5, _ := encoder.Encode("こんにちは世界")
+       metaBytes5 := resolver.GetMetaStrBytes(&metaStr5)
+       if err := resolver.WriteMetaStringBytes(buffer, metaBytes5); err != nil 
{
+               t.Fatalf("write failed: %v", err)
+       }
+       got5, _ := resolver.ReadMetaStringBytes(buffer)
+       if got5.Hashcode != metaBytes5.Hashcode || !bytes.Equal(got5.Data, 
metaBytes5.Data) {
+               t.Errorf("Mismatch in Japanese string")
+       }
+
+       // Test 6: Long string (more than 16 bytes, triggers hash-based 
encoding)
+       longStr := strings.Repeat("hello, world", 10)
+       metaStr6, _ := encoder.Encode(longStr)
+       metaBytes6 := resolver.GetMetaStrBytes(&metaStr6)
+       if err := resolver.WriteMetaStringBytes(buffer, metaBytes6); err != nil 
{
+               t.Fatalf("write failed: %v", err)
+       }
+       got6, _ := resolver.ReadMetaStringBytes(buffer)
+       if got6.Hashcode != metaBytes6.Hashcode || !bytes.Equal(got6.Data, 
metaBytes6.Data) {
+               t.Errorf("Mismatch in long string")
+       }
+}
diff --git a/python/pyfory/tests/test_metastring_resolver.py 
b/python/pyfory/tests/test_metastring_resolver.py
index 22846afb..a2ecd5af 100644
--- a/python/pyfory/tests/test_metastring_resolver.py
+++ b/python/pyfory/tests/test_metastring_resolver.py
@@ -23,14 +23,43 @@ from pyfory.meta.metastring import MetaStringEncoder
 def test_metastring_resolver():
     resolver = MetaStringResolver()
     encoder = MetaStringEncoder("$", "_")
+
+    # Test 1: Regular English string
     metastr1 = encoder.encode("hello, world")
     metabytes1 = resolver.get_metastr_bytes(metastr1)
     buffer = Buffer.allocate(32)
     resolver.write_meta_string_bytes(buffer, metabytes1)
     assert resolver.read_meta_string_bytes(buffer) == metabytes1
+
+    # Test 2: Manually constructed MetaStringBytes
     metabytes2 = MetaStringBytes(
         data=b"\xbf\x05\xa4q\xa9\x92S\x96\xa6IOr\x9ch)\x80",
         hashcode=-5456063526933366015,
     )
     resolver.write_meta_string_bytes(buffer, metabytes2)
     assert resolver.read_meta_string_bytes(buffer) == metabytes2
+
+    # Test 3: Empty string
+    metastr_null = encoder.encode("")
+    metabytes_null = resolver.get_metastr_bytes(metastr_null)
+    resolver.write_meta_string_bytes(buffer, metabytes_null)
+    assert resolver.read_meta_string_bytes(buffer) == metabytes_null
+
+    # Test 4: Chinese string
+    metastr_cn = encoder.encode("你好,世界")
+    metabytes_cn = resolver.get_metastr_bytes(metastr_cn)
+    resolver.write_meta_string_bytes(buffer, metabytes_cn)
+    assert resolver.read_meta_string_bytes(buffer) == metabytes_cn
+
+    # Test 5: Japanese string (more than 16 bytes, triggers hash-based 
encoding)
+    metastr_jp = encoder.encode("こんにちは世界")
+    metabytes_jp = resolver.get_metastr_bytes(metastr_jp)
+    resolver.write_meta_string_bytes(buffer, metabytes_jp)
+    assert resolver.read_meta_string_bytes(buffer) == metabytes_jp
+
+    # Test 6: Long string (more than 16 bytes, triggers hash-based encoding)
+    long_str = "hello, world" * 10
+    metastr_long = encoder.encode(long_str)
+    metabytes_long = resolver.get_metastr_bytes(metastr_long)
+    resolver.write_meta_string_bytes(buffer, metabytes_long)
+    assert resolver.read_meta_string_bytes(buffer) == metabytes_long


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to