rynewang opened a new pull request, #3303:
URL: https://github.com/apache/iceberg-python/pull/3303

   ## Summary
   
   Mirrors the existing `CythonBinaryDecoder` (`decoder_fast.pyx`). The 
pure-Python `BinaryEncoder` emits each varint byte as a fresh `bytes([x])` 
allocation plus a stream-write call; the Cython implementation writes into a 
growable `char*` buffer with inlined zigzag encoding and `memcpy`, then 
materialises once via `getvalue()`.
   
   ## Integration
   
   `AvroOutputFile.write_block` now constructs its in-memory block encoder via 
a new `new_memory_encoder()` factory (same pattern as `new_decoder()`): returns 
`CythonBinaryEncoder` when the extension is built, otherwise a thin 
`MemoryBinaryEncoder` wrapper around the existing `BinaryEncoder` + `BytesIO`. 
The header/framing encoder (`self.encoder`) is unchanged — it writes directly 
to the output stream and is low-volume.
   
   ## Benchmark
   
   Encoding 50k `ManifestEntry` records (14 columns with full column stats — 
`column_sizes`, `value_counts`, `null_value_counts`, `lower_bounds`, 
`upper_bounds`), through the real `construct_writer` tree:
   
   | encoder | wall | throughput | output bytes |
   |---|---|---|---|
   | pure Python | 1.64 s | 30.5 k/s | 18,492,808 |
   | Cython | 0.36 s | 138.0 k/s | 18,492,808 |
   
   ~4.5× at the encoder-leaf level; the remaining time is the Python `Writer` 
tree dispatch, which is unchanged.
   
   ## Testing
   
   - `tests/avro/test_encoder.py` is parametrised over both implementations so 
every primitive assertion runs against each.
   - New `test_int_round_trip` covers zigzag edge cases including `int64` 
min/max via encode→`new_decoder`→assert.
   - New `test_encoders_byte_identical` asserts both implementations produce 
identical bytes for a mixed payload.
   - Existing `tests/avro/` (171 tests) and `tests/utils/test_manifest.py` 
(manifest write/read round-trip) pass.
   
   ## Notes
   
   - `write_utf8` / `write_bytes` accept untyped args (matching the pure-Python 
duck-typed behaviour) since callers pass `str`-enum values like 
`FileFormat.PARQUET`.
   - `write_float` / `write_double` use `STRUCT_FLOAT.pack` (explicit 
little-endian) rather than raw `memcpy`, same as the decoder — they're not on 
the hot path.
   - Zigzag is done on `uint64_t` to avoid signed-shift UB.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to