jonbjo opened a new pull request, #97: URL: https://github.com/apache/datasketches-rust/pull/97
## Summary Add binary serialization support for Theta sketches, compatible with Apache DataSketches Java and C++ implementations. ## Changes - Add `CompactThetaSketch` type with `serialize()` and `deserialize()` methods - Add convenience methods on `ThetaSketch`: `compact()`, `serialize()`, `deserialize()` - Support all compact sketch formats: empty, single-item, exact mode, and estimation mode - Handle legacy `seed_hash=0` format for backward compatibility - Add cross-language compatibility tests using Java-generated test data ## Motivation This enables reading and writing Theta sketches in formats used by Iceberg Puffin files. ## Limitations This PR focuses on the standard compact format. Identified features not included: - **Compressed format** - Java's `toByteArrayCompressed()` uses bit-packing for smaller size; not supported - **Non-compact format** - Only compact sketches can be deserialized - **Single-item serialization optimization** - We deserialize single-item format but always serialize using the standard exact-mode format (functionally correct, just slightly larger for single-item sketches) These could be added in follow-up PRs if needed. ## Testing - Unit tests for serialization round-trips - Cross-language compatibility tests that deserialize sketches generated by `datasketches-java` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
