Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

tison Sat, 14 Feb 2026 01:47:46 -0800

This PR [1] of datasketches-rust demonstrates how the Rust impl
deserializes String values.


[1] https://github.com/apache/datasketches-rust/pull/82

If it's std::string::String, then it must be of UTF-8 encoding. And we
check the encoding on deserialization.

However, the Rust ecosystem also supports "strings" that do not use UTF-8,
such as BStr.

So, my opinions are:

1. It's good to assume serialized string data to be valid UTF-8.
2. Even if it isn't, for datasketches-rust, users should be able to choose
a proper type to deserialize the bytes into a type that doesn't require
UTF-8 encoding.

Best,
tison.


Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：

> Hi all,
>
> While working on UTF-8 validation for the AoS tuple sketch in C++ (ref:
> https://github.com/apache/datasketches-cpp/pull/476), a broader design
> question came up that may affect multiple sketches.
>
> Based on my current understanding:
>
> - In datasketches-java, string serialization already produces valid UTF-8
> bytes via getBytes(StandardCharsets.UTF_8). So Java-generated artifacts
> already assume valid UTF-8 string encoding.
> - Rust and Python string types represent Unicode text and can be encoded
> to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python
> well)
> - In Go, string is a byte sequence and may contain invalid UTF-8 unless
> explicitly validated. So during serialization, it may produce invalid UTF-8
> sequences.
> - In C++, std::string is also a byte container and does not enforce UTF-8
> validity. So during serialization, it may produce invalid UTF-8 sequences.
>
> If I am mistaken on any of these points, I would appreciate corrections.
>
> If we want to maintain cross-language portability for serialized
> artifacts, one possible approach would be to ensure that any serialized
> string data is valid UTF-8. This could potentially apply to any sketches
> that serialize or deserialize string data.
>
> There seem to be several possible approaches:
> - Validate UTF-8 at serialization boundaries
> - Document that input strings must be valid UTF-8 and rely on caller
> discipline
>
> At this point I am not proposing a specific solution. I would like to hear
> opinions from the community on: We want to require serialized string data
> to be valid UTF-8 for cross-language portability
>
> Thanks,
>
> Hyeonho
>

Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to