Hi all,

While working on UTF-8 validation for the AoS tuple sketch in C++ (ref:
https://github.com/apache/datasketches-cpp/pull/476), a broader design
question came up that may affect multiple sketches.

Based on my current understanding:

- In datasketches-java, string serialization already produces valid UTF-8
bytes via getBytes(StandardCharsets.UTF_8). So Java-generated artifacts
already assume valid UTF-8 string encoding.
- Rust and Python string types represent Unicode text and can be encoded to
UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python
well)
- In Go, string is a byte sequence and may contain invalid UTF-8 unless
explicitly validated. So during serialization, it may produce invalid UTF-8
sequences.
- In C++, std::string is also a byte container and does not enforce UTF-8
validity. So during serialization, it may produce invalid UTF-8 sequences.

If I am mistaken on any of these points, I would appreciate corrections.

If we want to maintain cross-language portability for serialized artifacts,
one possible approach would be to ensure that any serialized string data is
valid UTF-8. This could potentially apply to any sketches that serialize or
deserialize string data.

There seem to be several possible approaches:
- Validate UTF-8 at serialization boundaries
- Document that input strings must be valid UTF-8 and rely on caller
discipline

At this point I am not proposing a specific solution. I would like to hear
opinions from the community on: We want to require serialized string data
to be valid UTF-8 for cross-language portability

Thanks,

Hyeonho

Reply via email to