Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Hyeonho Kim Fri, 06 Mar 2026 22:17:36 -0800

Hi all,

I realized there is one more design point that may need discussion.


For sketches that validate UTF-8 at update() time by default, with an
explicit opt-out, that setting affects the behavior of future update()
calls even after deserialization.

So there seems to be a broader design choice here for string-specific
sketches / update APIs:

   1.

   Treat the UTF-8 validation setting as part of the serialized sketch
   state, so it is preserved across serialization/deserialization.
   2.

   Treat it as a runtime policy only, in which case it would need to be
   specified again after deserialization (or when constructing a new sketch).

The first option would preserve behavioral consistency, so it seems like
the more semantically consistent choice. However, it also seems like a much
bigger decision in practice, since it would require a serialization format
change / versioning.

The second option avoids changing the serialized format, but a deserialized
sketch may not behave exactly the same for future update() calls unless the
caller explicitly restores the same policy.

What do others think?

On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote:

> I agree. Here is a proposed wording that is a sort of a "policy" way to
> think about this:
>
> For "container" type sketches that can potentially retain Strings:
>
>    - If a sketch has the word "string" as part of its name, then UTF-8
>    validation at update() should be the default with an explicit
>    opt-out.  Example: ArrayOfStringsTupleSketch.
>    - If an update method to a sketch has an explicit "string" parameter,
>    then UTF-8 validation should be the default with an explicit opt-out.
>    Example FdtSketch::update(String[]).
>    - Otherwise, if a sketch or update method accepts just a generic type
>    T, then we will provide a UTF-8 validating "SerDe" object that can be
>    optionally used for type T.
>
>
>
> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote:
>
>> Hi all!
>>
>> Unless there are objections, I propose the following:
>>
>>    1.
>>
>>    Introduce an opt-in UTF-8 validating SerDe for std::string
>>    (validation OFF by default).
>>    2.
>>
>>    For AoS string items, enable UTF-8 validation at update() by default,
>>    with an explicit opt-out.
>>
>> If this direction looks reasonable, I will proceed accordingly in the AoS
>> PR and follow up with a separate PR for the SerDe option.
>>
>>
>> Thanks,
>>
>> Hyeonho
>>
>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]> wrote:
>>
>>> Thanks all for the feedback.
>>>
>>>
>>> We can preserve backward compatibility for existing C++ users while also
>>> providing a clear path for cross-language portability.
>>>
>>> How do you think about the following approach?
>>>
>>> - SerDe with string: Add an option to validate whether the string
>>> contains valid UTF-8 sequences. The default would be validation OFF to
>>> preserve existing compatibility.
>>>
>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast).
>>> Enabling validation by default, with an explicit opt-out for users who want.
>>>
>>>
>>> For DS-Go, we can follow the same policy as C++.
>>>
>>>
>>> Feedback is welcome.
>>>
>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]> wrote:
>>>
>>>> Gonna agree with Alexander here. I think we should provide a serde
>>>> option for c++, but that we should not reject non-UTF-8 strings.
>>>>
>>>> That wouldn’t just be an API-breaking change. It would break
>>>> compatibility of c++ with itself for anyone who doesn’t need language
>>>> portability.
>>>>
>>>> A separate utf8_serde option gets my vote.
>>>>
>>>>   jon
>>>>
>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev <
>>>> [email protected]> wrote:
>>>>
>>>>> Regarding C++, I would think that the easiest approach is to instruct
>>>>> the user to use a UTF8-validating string substitute instead of 
>>>>> std::string.
>>>>> I am not sure whether we should provide such a thing or let the user
>>>>> to come up with their own implementation.
>>>>> Consider having a uft8_string that would validate the input in the
>>>>> constrtuctor but otherwise identical to std::string
>>>>> So the user can instantiate, for example,
>>>>> frequent_items_sketch<utf8_string> instead of
>>>>> frequent_items_sketch<std::string> if validation is necessary.
>>>>>
>>>>>
>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks for the feedback. I agree that for container sketches that
>>>>>> retain and serialize strings, we should validate that string payloads are
>>>>>> valid UTF-8 sequences to preserve cross-language portability.
>>>>>>
>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest
>>>>>> time) is attractive because it is fail-fast, but it also adds additional
>>>>>> cost on the hot path. If the community is comfortable with that overhead
>>>>>> for string-based container sketches, I’m happy to pursue the 
>>>>>> update()-time
>>>>>> validation approach.
>>>>>>
>>>>>> If performance sensitivity is a concern, an alternative would be to
>>>>>> always validate at (de)serialization boundaries (to guarantee artifact
>>>>>> correctness), and optionally provide a “fail-fast” mode that enables
>>>>>> validation at update() as well.
>>>>>>
>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit
>>>>>> simpler in implementation because it provides UTF-8 validation in the
>>>>>> standard library (unicode/utf8), so we wouldn’t need an external
>>>>>> dependency for the validator.
>>>>>>
>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote:
>>>>>>
>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow a
>>>>>>> user to update the sketch with a string and the sketch also retains 
>>>>>>> within
>>>>>>> the sketch a sample of the input strings seen. When serialized, there 
>>>>>>> is an
>>>>>>> implicit assumption that another user, possibly in a different language,
>>>>>>> can successfully deserialize those sketch images. These sketches 
>>>>>>> include KLL,
>>>>>>> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We
>>>>>>> informally call these "container" sketches, because they contain actual
>>>>>>> samples from the input stream.  HLL, Theta, CPC, BloomFilter, etc., are 
>>>>>>> not
>>>>>>> container sketches.
>>>>>>>
>>>>>>> In the DS-Java library, all container sketches that allow strings
>>>>>>> always use UTF_8. So the sketch images produced will contain proper 
>>>>>>> UTF_8
>>>>>>> sequences.
>>>>>>>
>>>>>>> In the DS-CPP library, all the various data types are abstracted via
>>>>>>> templates. The serialization operation is declared similar to
>>>>>>>
>>>>>>>
>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is
>>>>>>> the item type*, os is the output stream and sd* *is the SerDe that
>>>>>>> performs the conversion to bytes. *
>>>>>>>
>>>>>>>
>>>>>>> If the user wants to use an item of type string, *T* would
>>>>>>> typically be of type *std::string*, which is just a blob of bytes
>>>>>>> and no requirement that it is UTF_8.
>>>>>>>
>>>>>>>
>>>>>>> So far, we have trusted users of the library to know that if they
>>>>>>> update one of these container classes with a type *T,* that the
>>>>>>> downstream user can successfully decode it. But this could be
>>>>>>> catastrophic:  A downstream user of a sketch image could be separated 
>>>>>>> from
>>>>>>> the creation of the sketch image by years and be using a different
>>>>>>> language.
>>>>>>>
>>>>>>> One of the big advantages of our DataSketches project is that our
>>>>>>> serialization images should be language and platform independent, 
>>>>>>> allowing
>>>>>>> cross-language and cross platform interchange of sketches.
>>>>>>>
>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch
>>>>>>> images that contain strings, those strings must be UTF_8.
>>>>>>>
>>>>>>> So how do we implement that?  My thoughts are as follows:
>>>>>>>
>>>>>>>    1. We should document now in the website and in appropriate
>>>>>>>    places in the library the potential danger of not using UTF_8 
>>>>>>> strings. (At
>>>>>>>    least until we have a more robust solution)
>>>>>>>    2. I think implementing validation checks on UTF_8 strings at
>>>>>>>    the SerDe boundaries may be too late.  A user could have processed a 
>>>>>>> large
>>>>>>>    stream of data only to discover a failure at serialization time, 
>>>>>>> which
>>>>>>>    could be much later in time.  The other possibility would be to 
>>>>>>> validate
>>>>>>>    the strings at the input into the sketch, typically in the *update()
>>>>>>>    *method.
>>>>>>>    3. For C++, there are 3rd party libraries that specialize in
>>>>>>>    UTF_8 validation, including ICU
>>>>>>>    
>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$>
>>>>>>>    , UTF8-CPP
>>>>>>>    
>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$>
>>>>>>>    and simjson
>>>>>>>    
>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>.
>>>>>>>    (These have standard licensing). From what I've read, UTF-8 
>>>>>>> validation, if
>>>>>>>    done correctly, can be done very fast, with only a small section of 
>>>>>>> code.
>>>>>>>    4. I am not sure what the solutions are for Rust or Go.
>>>>>>>
>>>>>>> I welcome your feedback.
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote:
>>>>>>>
>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>>>>>>>> deserializes String values.
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82
>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$>
>>>>>>>>
>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding. And
>>>>>>>> we check the encoding on deserialization.
>>>>>>>>
>>>>>>>> However, the Rust ecosystem also supports "strings" that do not use
>>>>>>>> UTF-8, such as BStr.
>>>>>>>>
>>>>>>>> So, my opinions are:
>>>>>>>>
>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8.
>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able to
>>>>>>>> choose a proper type to deserialize the bytes into a type that doesn't
>>>>>>>> require UTF-8 encoding.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> tison.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：
>>>>>>>>
>>>>>>>>> Hi all,
>>>>>>>>>
>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in C++
>>>>>>>>> (ref: https://github.com/apache/datasketches-cpp/pull/476
>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>),
>>>>>>>>> a broader design question came up that may affect multiple sketches.
>>>>>>>>>
>>>>>>>>> Based on my current understanding:
>>>>>>>>>
>>>>>>>>> - In datasketches-java, string serialization already produces
>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So 
>>>>>>>>> Java-generated
>>>>>>>>> artifacts already assume valid UTF-8 string encoding.
>>>>>>>>> - Rust and Python string types represent Unicode text and can be
>>>>>>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know 
>>>>>>>>> Rust
>>>>>>>>> and Python well)
>>>>>>>>> - In Go, string is a byte sequence and may contain invalid UTF-8
>>>>>>>>> unless explicitly validated. So during serialization, it may produce
>>>>>>>>> invalid UTF-8 sequences.
>>>>>>>>> - In C++, std::string is also a byte container and does not
>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce 
>>>>>>>>> invalid
>>>>>>>>> UTF-8 sequences.
>>>>>>>>>
>>>>>>>>> If I am mistaken on any of these points, I would appreciate
>>>>>>>>> corrections.
>>>>>>>>>
>>>>>>>>> If we want to maintain cross-language portability for serialized
>>>>>>>>> artifacts, one possible approach would be to ensure that any 
>>>>>>>>> serialized
>>>>>>>>> string data is valid UTF-8. This could potentially apply to any 
>>>>>>>>> sketches
>>>>>>>>> that serialize or deserialize string data.
>>>>>>>>>
>>>>>>>>> There seem to be several possible approaches:
>>>>>>>>> - Validate UTF-8 at serialization boundaries
>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on
>>>>>>>>> caller discipline
>>>>>>>>>
>>>>>>>>> At this point I am not proposing a specific solution. I would like
>>>>>>>>> to hear opinions from the community on: We want to require serialized
>>>>>>>>> string data to be valid UTF-8 for cross-language portability
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Hyeonho
>>>>>>>>>
>>>>>>>>

Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to