This issue, raised by Hyeonho Kim, relates to sketches that allow a user to update the sketch with a string and the sketch also retains within the sketch a sample of the input strings seen. When serialized, there is an implicit assumption that another user, possibly in a different language, can successfully deserialize those sketch images. These sketches include KLL, REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We informally call these "container" sketches, because they contain actual samples from the input stream. HLL, Theta, CPC, BloomFilter, etc., are not container sketches.
In the DS-Java library, all container sketches that allow strings always use UTF_8. So the sketch images produced will contain proper UTF_8 sequences. In the DS-CPP library, all the various data types are abstracted via templates. The serialization operation is declared similar to *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is the item type*, os is the output stream and sd* *is the SerDe that performs the conversion to bytes. * If the user wants to use an item of type string, *T* would typically be of type *std::string*, which is just a blob of bytes and no requirement that it is UTF_8. So far, we have trusted users of the library to know that if they update one of these container classes with a type *T,* that the downstream user can successfully decode it. But this could be catastrophic: A downstream user of a sketch image could be separated from the creation of the sketch image by years and be using a different language. One of the big advantages of our DataSketches project is that our serialization images should be language and platform independent, allowing cross-language and cross platform interchange of sketches. Hyeonho Kim's recommendation makes sense: For serialized sketch images that contain strings, those strings must be UTF_8. So how do we implement that? My thoughts are as follows: 1. We should document now in the website and in appropriate places in the library the potential danger of not using UTF_8 strings. (At least until we have a more robust solution) 2. I think implementing validation checks on UTF_8 strings at the SerDe boundaries may be too late. A user could have processed a large stream of data only to discover a failure at serialization time, which could be much later in time. The other possibility would be to validate the strings at the input into the sketch, typically in the *update() *method. 3. For C++, there are 3rd party libraries that specialize in UTF_8 validation, including ICU <https://github.com/unicode-org/icu>, UTF8-CPP <https://github.com/nemtrif/utfcpp> and simjson <https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/>. (These have standard licensing). From what I've read, UTF-8 validation, if done correctly, can be done very fast, with only a small section of code. 4. I am not sure what the solutions are for Rust or Go. I welcome your feedback. On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote: > This PR [1] of datasketches-rust demonstrates how the Rust impl > deserializes String values. > > [1] https://github.com/apache/datasketches-rust/pull/82 > > If it's std::string::String, then it must be of UTF-8 encoding. And we > check the encoding on deserialization. > > However, the Rust ecosystem also supports "strings" that do not use UTF-8, > such as BStr. > > So, my opinions are: > > 1. It's good to assume serialized string data to be valid UTF-8. > 2. Even if it isn't, for datasketches-rust, users should be able to choose > a proper type to deserialize the bytes into a type that doesn't require > UTF-8 encoding. > > Best, > tison. > > > Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: > >> Hi all, >> >> While working on UTF-8 validation for the AoS tuple sketch in C++ (ref: >> https://github.com/apache/datasketches-cpp/pull/476), a broader design >> question came up that may affect multiple sketches. >> >> Based on my current understanding: >> >> - In datasketches-java, string serialization already produces valid UTF-8 >> bytes via getBytes(StandardCharsets.UTF_8). So Java-generated artifacts >> already assume valid UTF-8 string encoding. >> - Rust and Python string types represent Unicode text and can be encoded >> to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python >> well) >> - In Go, string is a byte sequence and may contain invalid UTF-8 unless >> explicitly validated. So during serialization, it may produce invalid UTF-8 >> sequences. >> - In C++, std::string is also a byte container and does not enforce UTF-8 >> validity. So during serialization, it may produce invalid UTF-8 sequences. >> >> If I am mistaken on any of these points, I would appreciate corrections. >> >> If we want to maintain cross-language portability for serialized >> artifacts, one possible approach would be to ensure that any serialized >> string data is valid UTF-8. This could potentially apply to any sketches >> that serialize or deserialize string data. >> >> There seem to be several possible approaches: >> - Validate UTF-8 at serialization boundaries >> - Document that input strings must be valid UTF-8 and rely on caller >> discipline >> >> At this point I am not proposing a specific solution. I would like to >> hear opinions from the community on: We want to require serialized string >> data to be valid UTF-8 for cross-language portability >> >> Thanks, >> >> Hyeonho >> >
