Regarding C++, I would think that the easiest approach is to instruct the
user to use a UTF8-validating string substitute instead of std::string.
I am not sure whether we should provide such a thing or let the user to
come up with their own implementation.
Consider having a uft8_string that would validate the input in the
constrtuctor but otherwise identical to std::string
So the user can instantiate, for example,
frequent_items_sketch<utf8_string> instead of
frequent_items_sketch<std::string> if validation is necessary.


On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> wrote:

> Thanks for the feedback. I agree that for container sketches that retain
> and serialize strings, we should validate that string payloads are valid
> UTF-8 sequences to preserve cross-language portability.
>
> On *where* to validate in DS-CPP: validating at update() (ingest time) is
> attractive because it is fail-fast, but it also adds additional cost on the
> hot path. If the community is comfortable with that overhead for
> string-based container sketches, I’m happy to pursue the update()-time
> validation approach.
>
> If performance sensitivity is a concern, an alternative would be to always
> validate at (de)serialization boundaries (to guarantee artifact
> correctness), and optionally provide a “fail-fast” mode that enables
> validation at update() as well.
>
> For DS-Go, we can follow the same policy. Go’s situation is a bit simpler
> in implementation because it provides UTF-8 validation in the standard
> library (unicode/utf8), so we wouldn’t need an external dependency for
> the validator.
>
> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote:
>
>> This issue, raised by Hyeonho Kim, relates to sketches that allow a user
>> to update the sketch with a string and the sketch also retains within
>> the sketch a sample of the input strings seen. When serialized, there is an
>> implicit assumption that another user, possibly in a different language,
>> can successfully deserialize those sketch images. These sketches include KLL,
>> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We
>> informally call these "container" sketches, because they contain actual
>> samples from the input stream.  HLL, Theta, CPC, BloomFilter, etc., are not
>> container sketches.
>>
>> In the DS-Java library, all container sketches that allow strings always
>> use UTF_8. So the sketch images produced will contain proper UTF_8
>> sequences.
>>
>> In the DS-CPP library, all the various data types are abstracted via
>> templates. The serialization operation is declared similar to
>>
>>
>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is the
>> item type*, os is the output stream and sd* *is the SerDe that performs
>> the conversion to bytes. *
>>
>>
>> If the user wants to use an item of type string, *T* would typically be
>> of type *std::string*, which is just a blob of bytes and no requirement
>> that it is UTF_8.
>>
>>
>> So far, we have trusted users of the library to know that if they update
>> one of these container classes with a type *T,* that the downstream user
>> can successfully decode it. But this could be catastrophic:  A downstream
>> user of a sketch image could be separated from the creation of the sketch
>> image by years and be using a different language.
>>
>> One of the big advantages of our DataSketches project is that our
>> serialization images should be language and platform independent, allowing
>> cross-language and cross platform interchange of sketches.
>>
>> Hyeonho Kim's recommendation makes sense: For serialized sketch images
>> that contain strings, those strings must be UTF_8.
>>
>> So how do we implement that?  My thoughts are as follows:
>>
>>    1. We should document now in the website and in appropriate places in
>>    the library the potential danger of not using UTF_8 strings. (At least
>>    until we have a more robust solution)
>>    2. I think implementing validation checks on UTF_8 strings at the
>>    SerDe boundaries may be too late.  A user could have processed a large
>>    stream of data only to discover a failure at serialization time, which
>>    could be much later in time.  The other possibility would be to validate
>>    the strings at the input into the sketch, typically in the *update() *
>>    method.
>>    3. For C++, there are 3rd party libraries that specialize in UTF_8
>>    validation, including ICU
>>    
>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$>
>>    , UTF8-CPP
>>    
>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$>
>>    and simjson
>>    
>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>.
>>    (These have standard licensing). From what I've read, UTF-8 validation, if
>>    done correctly, can be done very fast, with only a small section of code.
>>    4. I am not sure what the solutions are for Rust or Go.
>>
>> I welcome your feedback.
>>
>>
>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote:
>>
>>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>>> deserializes String values.
>>>
>>> [1] https://github.com/apache/datasketches-rust/pull/82
>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$>
>>>
>>> If it's std::string::String, then it must be of UTF-8 encoding. And we
>>> check the encoding on deserialization.
>>>
>>> However, the Rust ecosystem also supports "strings" that do not use
>>> UTF-8, such as BStr.
>>>
>>> So, my opinions are:
>>>
>>> 1. It's good to assume serialized string data to be valid UTF-8.
>>> 2. Even if it isn't, for datasketches-rust, users should be able to
>>> choose a proper type to deserialize the bytes into a type that doesn't
>>> require UTF-8 encoding.
>>>
>>> Best,
>>> tison.
>>>
>>>
>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道:
>>>
>>>> Hi all,
>>>>
>>>> While working on UTF-8 validation for the AoS tuple sketch in C++ (ref:
>>>> https://github.com/apache/datasketches-cpp/pull/476
>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>),
>>>> a broader design question came up that may affect multiple sketches.
>>>>
>>>> Based on my current understanding:
>>>>
>>>> - In datasketches-java, string serialization already produces valid
>>>> UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated
>>>> artifacts already assume valid UTF-8 string encoding.
>>>> - Rust and Python string types represent Unicode text and can be
>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know Rust
>>>> and Python well)
>>>> - In Go, string is a byte sequence and may contain invalid UTF-8 unless
>>>> explicitly validated. So during serialization, it may produce invalid UTF-8
>>>> sequences.
>>>> - In C++, std::string is also a byte container and does not enforce
>>>> UTF-8 validity. So during serialization, it may produce invalid UTF-8
>>>> sequences.
>>>>
>>>> If I am mistaken on any of these points, I would appreciate corrections.
>>>>
>>>> If we want to maintain cross-language portability for serialized
>>>> artifacts, one possible approach would be to ensure that any serialized
>>>> string data is valid UTF-8. This could potentially apply to any sketches
>>>> that serialize or deserialize string data.
>>>>
>>>> There seem to be several possible approaches:
>>>> - Validate UTF-8 at serialization boundaries
>>>> - Document that input strings must be valid UTF-8 and rely on caller
>>>> discipline
>>>>
>>>> At this point I am not proposing a specific solution. I would like to
>>>> hear opinions from the community on: We want to require serialized string
>>>> data to be valid UTF-8 for cross-language portability
>>>>
>>>> Thanks,
>>>>
>>>> Hyeonho
>>>>
>>>

Reply via email to