This issue, raised by Hyeonho Kim, relates to sketches that allow a user to
update the sketch with a string and the sketch also retains within
the sketch a sample of the input strings seen. When serialized, there is an
implicit assumption that another user, possibly in a different language,
can successfully deserialize those sketch images. These sketches include KLL,
REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We informally
call these "container" sketches, because they contain actual samples from
the input stream.  HLL, Theta, CPC, BloomFilter, etc., are not container
sketches.

In the DS-Java library, all container sketches that allow strings always
use UTF_8. So the sketch images produced will contain proper UTF_8
sequences.

In the DS-CPP library, all the various data types are abstracted via
templates. The serialization operation is declared similar to


*sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is the
item type*, os is the output stream and sd* *is the SerDe that performs the
conversion to bytes. *


If the user wants to use an item of type string, *T* would typically be of
type *std::string*, which is just a blob of bytes and no requirement that
it is UTF_8.


So far, we have trusted users of the library to know that if they update
one of these container classes with a type *T,* that the downstream user
can successfully decode it. But this could be catastrophic:  A downstream
user of a sketch image could be separated from the creation of the sketch
image by years and be using a different language.

One of the big advantages of our DataSketches project is that our
serialization images should be language and platform independent, allowing
cross-language and cross platform interchange of sketches.

Hyeonho Kim's recommendation makes sense: For serialized sketch images that
contain strings, those strings must be UTF_8.

So how do we implement that?  My thoughts are as follows:

   1. We should document now in the website and in appropriate places in
   the library the potential danger of not using UTF_8 strings. (At least
   until we have a more robust solution)
   2. I think implementing validation checks on UTF_8 strings at the SerDe
   boundaries may be too late.  A user could have processed a large stream of
   data only to discover a failure at serialization time, which could be much
   later in time.  The other possibility would be to validate the strings at
   the input into the sketch, typically in the *update() *method.
   3. For C++, there are 3rd party libraries that specialize in UTF_8
   validation, including ICU <https://github.com/unicode-org/icu>, UTF8-CPP
   <https://github.com/nemtrif/utfcpp> and simjson
   
<https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/>.
   (These have standard licensing). From what I've read, UTF-8 validation, if
   done correctly, can be done very fast, with only a small section of code.
   4. I am not sure what the solutions are for Rust or Go.

I welcome your feedback.


On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote:

> This PR [1] of datasketches-rust demonstrates how the Rust impl
> deserializes String values.
>
> [1] https://github.com/apache/datasketches-rust/pull/82
>
> If it's std::string::String, then it must be of UTF-8 encoding. And we
> check the encoding on deserialization.
>
> However, the Rust ecosystem also supports "strings" that do not use UTF-8,
> such as BStr.
>
> So, my opinions are:
>
> 1. It's good to assume serialized string data to be valid UTF-8.
> 2. Even if it isn't, for datasketches-rust, users should be able to choose
> a proper type to deserialize the bytes into a type that doesn't require
> UTF-8 encoding.
>
> Best,
> tison.
>
>
> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道:
>
>> Hi all,
>>
>> While working on UTF-8 validation for the AoS tuple sketch in C++ (ref:
>> https://github.com/apache/datasketches-cpp/pull/476), a broader design
>> question came up that may affect multiple sketches.
>>
>> Based on my current understanding:
>>
>> - In datasketches-java, string serialization already produces valid UTF-8
>> bytes via getBytes(StandardCharsets.UTF_8). So Java-generated artifacts
>> already assume valid UTF-8 string encoding.
>> - Rust and Python string types represent Unicode text and can be encoded
>> to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python
>> well)
>> - In Go, string is a byte sequence and may contain invalid UTF-8 unless
>> explicitly validated. So during serialization, it may produce invalid UTF-8
>> sequences.
>> - In C++, std::string is also a byte container and does not enforce UTF-8
>> validity. So during serialization, it may produce invalid UTF-8 sequences.
>>
>> If I am mistaken on any of these points, I would appreciate corrections.
>>
>> If we want to maintain cross-language portability for serialized
>> artifacts, one possible approach would be to ensure that any serialized
>> string data is valid UTF-8. This could potentially apply to any sketches
>> that serialize or deserialize string data.
>>
>> There seem to be several possible approaches:
>> - Validate UTF-8 at serialization boundaries
>> - Document that input strings must be valid UTF-8 and rely on caller
>> discipline
>>
>> At this point I am not proposing a specific solution. I would like to
>> hear opinions from the community on: We want to require serialized string
>> data to be valid UTF-8 for cross-language portability
>>
>> Thanks,
>>
>> Hyeonho
>>
>

Reply via email to