Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Lee Rhodes Sat, 07 Mar 2026 21:47:36 -0800

This has been a helpful discussion.  My thinking about this has also
changed, but for a different reason.


My proposal to have an encoding standard for strings came from a (noble?)
desire to help protect our users from footguns.

However, ensuring compatibility between any two sketches that have been
independently loaded is a much deeper can-of-worms than we have discussed
here:

   - Imagine a merge of two sketches inadvertently fed strings using
   different character encodings. It doesn't matter if the sketches originated
   from different programming languages or not.
   - Converting a string to a hash doesn't change this.  This means
   virtually all of our sketches could be vulnerable to this user mistake and
   not just our container sketches.
   - Natural numeric instability of doubles could also create similar
   silent failures if the user is not careful.

I don't think that there is any way we can programmatically protect our
users from all of these possible mistakes.

Having said that, providing some useful tools that could help the user
validate UTF-8 strings might be useful. It won't protect against all of the
potential user mistakes of this type, just perhaps some common ones.

But if we decide not to do anything programmatic, we could at least provide
sufficient warnings in the documentation of these possible, and easy to
make pitfalls.  We don't have to do this right away, but as the various
libraries move to new versions, this kind of documentation should be on the
list to add.





On Sat, Mar 7, 2026 at 2:57 AM Hyeonho Kim <[email protected]> wrote:

> Thanks.
>
> After thinking more about it and reviewing the C++ and Go code more
> closely, my view has changed.
>
> I now think that changing the serialization format just to preserve UTF-8
> validation behavior for C++ and Go would be too heavy. If we do not change
> the serialization format, then we cannot fully preserve behavioral
> consistency across serialization/deserialization anyway.
>
> At the same time, I do not think we should ignore language-independent
> sketch images for string-containing sketches.
> So my current view is that we should keep the sketch format unchanged and
> leave `update()` behavior unchanged.
>
> If possible, we provide an explicit portability path through UTF-8
> validating SerDe choices.
> If that is not desirable, then at minimum I think we should document this
> point clearly. In particular, I think we should document clearly that
> cross-language portability for string-containing sketches depends on using
> valid UTF-8.
>
>
> On Sat, Mar 7, 2026 at 4:47 PM Alexander Saydakov via dev <
> [email protected]> wrote:
>
>> I would reiterate that in my view sketches should not care about
>> validation.
>> If the user desires validation, he can instantiate, say,
>> frequent_items_sketch<utf8_string> instead of
>> frequent_items_sketch<std::string>.
>> utf8_string should perform validation.
>>
>> On Fri, Mar 6, 2026 at 10:17 PM Hyeonho Kim <[email protected]> wrote:
>>
>>> Hi all,
>>>
>>> I realized there is one more design point that may need discussion.
>>>
>>> For sketches that validate UTF-8 at update() time by default, with an
>>> explicit opt-out, that setting affects the behavior of future update()
>>> calls even after deserialization.
>>>
>>> So there seems to be a broader design choice here for string-specific
>>> sketches / update APIs:
>>>
>>>    1.
>>>
>>>    Treat the UTF-8 validation setting as part of the serialized sketch
>>>    state, so it is preserved across serialization/deserialization.
>>>    2.
>>>
>>>    Treat it as a runtime policy only, in which case it would need to be
>>>    specified again after deserialization (or when constructing a new 
>>> sketch).
>>>
>>> The first option would preserve behavioral consistency, so it seems like
>>> the more semantically consistent choice. However, it also seems like a much
>>> bigger decision in practice, since it would require a serialization format
>>> change / versioning.
>>>
>>> The second option avoids changing the serialized format, but a
>>> deserialized sketch may not behave exactly the same for future update()
>>> calls unless the caller explicitly restores the same policy.
>>>
>>> What do others think?
>>>
>>> On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote:
>>>
>>>> I agree. Here is a proposed wording that is a sort of a "policy" way to
>>>> think about this:
>>>>
>>>> For "container" type sketches that can potentially retain Strings:
>>>>
>>>>    - If a sketch has the word "string" as part of its name, then UTF-8
>>>>    validation at update() should be the default with an explicit
>>>>    opt-out.  Example: ArrayOfStringsTupleSketch.
>>>>    - If an update method to a sketch has an explicit "string"
>>>>    parameter, then UTF-8 validation should be the default with an explicit
>>>>    opt-out.  Example FdtSketch::update(String[]).
>>>>    - Otherwise, if a sketch or update method accepts just a generic
>>>>    type T, then we will provide a UTF-8 validating "SerDe" object that can 
>>>> be
>>>>    optionally used for type T.
>>>>
>>>>
>>>>
>>>> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote:
>>>>
>>>>> Hi all!
>>>>>
>>>>> Unless there are objections, I propose the following:
>>>>>
>>>>>    1.
>>>>>
>>>>>    Introduce an opt-in UTF-8 validating SerDe for std::string
>>>>>    (validation OFF by default).
>>>>>    2.
>>>>>
>>>>>    For AoS string items, enable UTF-8 validation at update() by
>>>>>    default, with an explicit opt-out.
>>>>>
>>>>> If this direction looks reasonable, I will proceed accordingly in the
>>>>> AoS PR and follow up with a separate PR for the SerDe option.
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Hyeonho
>>>>>
>>>>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks all for the feedback.
>>>>>>
>>>>>>
>>>>>> We can preserve backward compatibility for existing C++ users while
>>>>>> also providing a clear path for cross-language portability.
>>>>>>
>>>>>> How do you think about the following approach?
>>>>>>
>>>>>> - SerDe with string: Add an option to validate whether the string
>>>>>> contains valid UTF-8 sequences. The default would be validation OFF to
>>>>>> preserve existing compatibility.
>>>>>>
>>>>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast).
>>>>>> Enabling validation by default, with an explicit opt-out for users who 
>>>>>> want.
>>>>>>
>>>>>>
>>>>>> For DS-Go, we can follow the same policy as C++.
>>>>>>
>>>>>>
>>>>>> Feedback is welcome.
>>>>>>
>>>>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Gonna agree with Alexander here. I think we should provide a serde
>>>>>>> option for c++, but that we should not reject non-UTF-8 strings.
>>>>>>>
>>>>>>> That wouldn’t just be an API-breaking change. It would break
>>>>>>> compatibility of c++ with itself for anyone who doesn’t need language
>>>>>>> portability.
>>>>>>>
>>>>>>> A separate utf8_serde option gets my vote.
>>>>>>>
>>>>>>>   jon
>>>>>>>
>>>>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Regarding C++, I would think that the easiest approach is to
>>>>>>>> instruct the user to use a UTF8-validating string substitute instead of
>>>>>>>> std::string.
>>>>>>>> I am not sure whether we should provide such a thing or let the
>>>>>>>> user to come up with their own implementation.
>>>>>>>> Consider having a uft8_string that would validate the input in the
>>>>>>>> constrtuctor but otherwise identical to std::string
>>>>>>>> So the user can instantiate, for example,
>>>>>>>> frequent_items_sketch<utf8_string> instead of
>>>>>>>> frequent_items_sketch<std::string> if validation is necessary.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Thanks for the feedback. I agree that for container sketches that
>>>>>>>>> retain and serialize strings, we should validate that string payloads 
>>>>>>>>> are
>>>>>>>>> valid UTF-8 sequences to preserve cross-language portability.
>>>>>>>>>
>>>>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest
>>>>>>>>> time) is attractive because it is fail-fast, but it also adds 
>>>>>>>>> additional
>>>>>>>>> cost on the hot path. If the community is comfortable with that 
>>>>>>>>> overhead
>>>>>>>>> for string-based container sketches, I’m happy to pursue the
>>>>>>>>> update()-time validation approach.
>>>>>>>>>
>>>>>>>>> If performance sensitivity is a concern, an alternative would be
>>>>>>>>> to always validate at (de)serialization boundaries (to guarantee 
>>>>>>>>> artifact
>>>>>>>>> correctness), and optionally provide a “fail-fast” mode that enables
>>>>>>>>> validation at update() as well.
>>>>>>>>>
>>>>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit
>>>>>>>>> simpler in implementation because it provides UTF-8 validation in the
>>>>>>>>> standard library (unicode/utf8), so we wouldn’t need an external
>>>>>>>>> dependency for the validator.
>>>>>>>>>
>>>>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow
>>>>>>>>>> a user to update the sketch with a string and the sketch also retains
>>>>>>>>>> within the sketch a sample of the input strings seen. When 
>>>>>>>>>> serialized,
>>>>>>>>>> there is an implicit assumption that another user, possibly in a 
>>>>>>>>>> different
>>>>>>>>>> language, can successfully deserialize those sketch images. These 
>>>>>>>>>> sketches
>>>>>>>>>> include KLL, REQ, Classic Quantiles, Sampling, FrequentItems,
>>>>>>>>>> and Tuple. We informally call these "container" sketches, because 
>>>>>>>>>> they
>>>>>>>>>> contain actual samples from the input stream.  HLL, Theta, CPC,
>>>>>>>>>> BloomFilter, etc., are not container sketches.
>>>>>>>>>>
>>>>>>>>>> In the DS-Java library, all container sketches that allow strings
>>>>>>>>>> always use UTF_8. So the sketch images produced will contain proper 
>>>>>>>>>> UTF_8
>>>>>>>>>> sequences.
>>>>>>>>>>
>>>>>>>>>> In the DS-CPP library, all the various data types are abstracted
>>>>>>>>>> via templates. The serialization operation is declared similar to
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is
>>>>>>>>>> the item type*, os is the output stream and sd* *is the SerDe
>>>>>>>>>> that performs the conversion to bytes. *
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> If the user wants to use an item of type string, *T* would
>>>>>>>>>> typically be of type *std::string*, which is just a blob of
>>>>>>>>>> bytes and no requirement that it is UTF_8.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> So far, we have trusted users of the library to know that if they
>>>>>>>>>> update one of these container classes with a type *T,* that the
>>>>>>>>>> downstream user can successfully decode it. But this could be
>>>>>>>>>> catastrophic:  A downstream user of a sketch image could be 
>>>>>>>>>> separated from
>>>>>>>>>> the creation of the sketch image by years and be using a different
>>>>>>>>>> language.
>>>>>>>>>>
>>>>>>>>>> One of the big advantages of our DataSketches project is that our
>>>>>>>>>> serialization images should be language and platform independent, 
>>>>>>>>>> allowing
>>>>>>>>>> cross-language and cross platform interchange of sketches.
>>>>>>>>>>
>>>>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch
>>>>>>>>>> images that contain strings, those strings must be UTF_8.
>>>>>>>>>>
>>>>>>>>>> So how do we implement that?  My thoughts are as follows:
>>>>>>>>>>
>>>>>>>>>>    1. We should document now in the website and in appropriate
>>>>>>>>>>    places in the library the potential danger of not using UTF_8 
>>>>>>>>>> strings. (At
>>>>>>>>>>    least until we have a more robust solution)
>>>>>>>>>>    2. I think implementing validation checks on UTF_8 strings at
>>>>>>>>>>    the SerDe boundaries may be too late.  A user could have 
>>>>>>>>>> processed a large
>>>>>>>>>>    stream of data only to discover a failure at serialization time, 
>>>>>>>>>> which
>>>>>>>>>>    could be much later in time.  The other possibility would be to 
>>>>>>>>>> validate
>>>>>>>>>>    the strings at the input into the sketch, typically in the 
>>>>>>>>>> *update()
>>>>>>>>>>    *method.
>>>>>>>>>>    3. For C++, there are 3rd party libraries that specialize in
>>>>>>>>>>    UTF_8 validation, including ICU
>>>>>>>>>>    
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$>
>>>>>>>>>>    , UTF8-CPP
>>>>>>>>>>    
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$>
>>>>>>>>>>    and simjson
>>>>>>>>>>    
>>>>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>.
>>>>>>>>>>    (These have standard licensing). From what I've read, UTF-8 
>>>>>>>>>> validation, if
>>>>>>>>>>    done correctly, can be done very fast, with only a small section 
>>>>>>>>>> of code.
>>>>>>>>>>    4. I am not sure what the solutions are for Rust or Go.
>>>>>>>>>>
>>>>>>>>>> I welcome your feedback.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>>>>>>>>>>> deserializes String values.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$>
>>>>>>>>>>>
>>>>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding.
>>>>>>>>>>> And we check the encoding on deserialization.
>>>>>>>>>>>
>>>>>>>>>>> However, the Rust ecosystem also supports "strings" that do not
>>>>>>>>>>> use UTF-8, such as BStr.
>>>>>>>>>>>
>>>>>>>>>>> So, my opinions are:
>>>>>>>>>>>
>>>>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8.
>>>>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able
>>>>>>>>>>> to choose a proper type to deserialize the bytes into a type that 
>>>>>>>>>>> doesn't
>>>>>>>>>>> require UTF-8 encoding.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> tison.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>
>>>>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in
>>>>>>>>>>>> C++ (ref: https://github.com/apache/datasketches-cpp/pull/476
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>),
>>>>>>>>>>>> a broader design question came up that may affect multiple 
>>>>>>>>>>>> sketches.
>>>>>>>>>>>>
>>>>>>>>>>>> Based on my current understanding:
>>>>>>>>>>>>
>>>>>>>>>>>> - In datasketches-java, string serialization already produces
>>>>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So 
>>>>>>>>>>>> Java-generated
>>>>>>>>>>>> artifacts already assume valid UTF-8 string encoding.
>>>>>>>>>>>> - Rust and Python string types represent Unicode text and can
>>>>>>>>>>>> be encoded to UTF-8. Please correct me if I am mistaken. (I don't 
>>>>>>>>>>>> know Rust
>>>>>>>>>>>> and Python well)
>>>>>>>>>>>> - In Go, string is a byte sequence and may contain invalid
>>>>>>>>>>>> UTF-8 unless explicitly validated. So during serialization, it may 
>>>>>>>>>>>> produce
>>>>>>>>>>>> invalid UTF-8 sequences.
>>>>>>>>>>>> - In C++, std::string is also a byte container and does not
>>>>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce 
>>>>>>>>>>>> invalid
>>>>>>>>>>>> UTF-8 sequences.
>>>>>>>>>>>>
>>>>>>>>>>>> If I am mistaken on any of these points, I would appreciate
>>>>>>>>>>>> corrections.
>>>>>>>>>>>>
>>>>>>>>>>>> If we want to maintain cross-language portability for
>>>>>>>>>>>> serialized artifacts, one possible approach would be to ensure 
>>>>>>>>>>>> that any
>>>>>>>>>>>> serialized string data is valid UTF-8. This could potentially 
>>>>>>>>>>>> apply to any
>>>>>>>>>>>> sketches that serialize or deserialize string data.
>>>>>>>>>>>>
>>>>>>>>>>>> There seem to be several possible approaches:
>>>>>>>>>>>> - Validate UTF-8 at serialization boundaries
>>>>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on
>>>>>>>>>>>> caller discipline
>>>>>>>>>>>>
>>>>>>>>>>>> At this point I am not proposing a specific solution. I would
>>>>>>>>>>>> like to hear opinions from the community on: We want to require 
>>>>>>>>>>>> serialized
>>>>>>>>>>>> string data to be valid UTF-8 for cross-language portability
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>
>>>>>>>>>>>> Hyeonho
>>>>>>>>>>>>
>>>>>>>>>>>

Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to