This PR [1] of datasketches-rust demonstrates how the Rust impl deserializes String values.
[1] https://github.com/apache/datasketches-rust/pull/82 If it's std::string::String, then it must be of UTF-8 encoding. And we check the encoding on deserialization. However, the Rust ecosystem also supports "strings" that do not use UTF-8, such as BStr. So, my opinions are: 1. It's good to assume serialized string data to be valid UTF-8. 2. Even if it isn't, for datasketches-rust, users should be able to choose a proper type to deserialize the bytes into a type that doesn't require UTF-8 encoding. Best, tison. Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: > Hi all, > > While working on UTF-8 validation for the AoS tuple sketch in C++ (ref: > https://github.com/apache/datasketches-cpp/pull/476), a broader design > question came up that may affect multiple sketches. > > Based on my current understanding: > > - In datasketches-java, string serialization already produces valid UTF-8 > bytes via getBytes(StandardCharsets.UTF_8). So Java-generated artifacts > already assume valid UTF-8 string encoding. > - Rust and Python string types represent Unicode text and can be encoded > to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python > well) > - In Go, string is a byte sequence and may contain invalid UTF-8 unless > explicitly validated. So during serialization, it may produce invalid UTF-8 > sequences. > - In C++, std::string is also a byte container and does not enforce UTF-8 > validity. So during serialization, it may produce invalid UTF-8 sequences. > > If I am mistaken on any of these points, I would appreciate corrections. > > If we want to maintain cross-language portability for serialized > artifacts, one possible approach would be to ensure that any serialized > string data is valid UTF-8. This could potentially apply to any sketches > that serialize or deserialize string data. > > There seem to be several possible approaches: > - Validate UTF-8 at serialization boundaries > - Document that input strings must be valid UTF-8 and rely on caller > discipline > > At this point I am not proposing a specific solution. I would like to hear > opinions from the community on: We want to require serialized string data > to be valid UTF-8 for cross-language portability > > Thanks, > > Hyeonho >
