Hi all, While working on UTF-8 validation for the AoS tuple sketch in C++ (ref: https://github.com/apache/datasketches-cpp/pull/476), a broader design question came up that may affect multiple sketches.
Based on my current understanding: - In datasketches-java, string serialization already produces valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated artifacts already assume valid UTF-8 string encoding. - Rust and Python string types represent Unicode text and can be encoded to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python well) - In Go, string is a byte sequence and may contain invalid UTF-8 unless explicitly validated. So during serialization, it may produce invalid UTF-8 sequences. - In C++, std::string is also a byte container and does not enforce UTF-8 validity. So during serialization, it may produce invalid UTF-8 sequences. If I am mistaken on any of these points, I would appreciate corrections. If we want to maintain cross-language portability for serialized artifacts, one possible approach would be to ensure that any serialized string data is valid UTF-8. This could potentially apply to any sketches that serialize or deserialize string data. There seem to be several possible approaches: - Validate UTF-8 at serialization boundaries - Document that input strings must be valid UTF-8 and rely on caller discipline At this point I am not proposing a specific solution. I would like to hear opinions from the community on: We want to require serialized string data to be valid UTF-8 for cross-language portability Thanks, Hyeonho
