rynewang opened a new pull request, #3303: URL: https://github.com/apache/iceberg-python/pull/3303
## Summary Mirrors the existing `CythonBinaryDecoder` (`decoder_fast.pyx`). The pure-Python `BinaryEncoder` emits each varint byte as a fresh `bytes([x])` allocation plus a stream-write call; the Cython implementation writes into a growable `char*` buffer with inlined zigzag encoding and `memcpy`, then materialises once via `getvalue()`. ## Integration `AvroOutputFile.write_block` now constructs its in-memory block encoder via a new `new_memory_encoder()` factory (same pattern as `new_decoder()`): returns `CythonBinaryEncoder` when the extension is built, otherwise a thin `MemoryBinaryEncoder` wrapper around the existing `BinaryEncoder` + `BytesIO`. The header/framing encoder (`self.encoder`) is unchanged — it writes directly to the output stream and is low-volume. ## Benchmark Encoding 50k `ManifestEntry` records (14 columns with full column stats — `column_sizes`, `value_counts`, `null_value_counts`, `lower_bounds`, `upper_bounds`), through the real `construct_writer` tree: | encoder | wall | throughput | output bytes | |---|---|---|---| | pure Python | 1.64 s | 30.5 k/s | 18,492,808 | | Cython | 0.36 s | 138.0 k/s | 18,492,808 | ~4.5× at the encoder-leaf level; the remaining time is the Python `Writer` tree dispatch, which is unchanged. ## Testing - `tests/avro/test_encoder.py` is parametrised over both implementations so every primitive assertion runs against each. - New `test_int_round_trip` covers zigzag edge cases including `int64` min/max via encode→`new_decoder`→assert. - New `test_encoders_byte_identical` asserts both implementations produce identical bytes for a mixed payload. - Existing `tests/avro/` (171 tests) and `tests/utils/test_manifest.py` (manifest write/read round-trip) pass. ## Notes - `write_utf8` / `write_bytes` accept untyped args (matching the pure-Python duck-typed behaviour) since callers pass `str`-enum values like `FileFormat.PARQUET`. - `write_float` / `write_double` use `STRUCT_FLOAT.pack` (explicit little-endian) rather than raw `memcpy`, same as the decoder — they're not on the hot path. - Zigzag is done on `uint64_t` to avoid signed-shift UB. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
