RussellSpitzer opened a new pull request, #15634:
URL: https://github.com/apache/iceberg/pull/15634

   Extends V4 Manifest writer to allow it to write manifests in either Parquet 
or Avro based on the file extension. A default is also added to do Parquet 
Manifests in the SDK when the Version is 4. This could be parameterized later 
but that will require parameterizing the test suites so I decided on a single 
format (parquet) for now.
   
   There are a few other required changes here outside of testing
   
   1. Handling of splitOffsets in Parquet needs to be changed since BaseFile 
returns an immutable view which Parquet was attempting to re-use by clearing.
   
   2. Unpartitioned Tables need special care since parquet cannot store empty 
structs in the schema. This means reading from parquet manifests means skipping 
the parquet field and then changing read offsets if the partition is not 
defined. The read code is shared between all versions at this time so this 
change effects older avro readers as well.
   
   3. Some of the tests code for TestReplacePartitions assumed that you could 
validate against a slightly different vesrion of the table. This is a problem 
if the table you make is partitioned and the validation table is unpartitioned. 
It use to work ... accidently I think because we would make unpartitioned 
operations committed to a partitioned table.
   
   --- Some Benchmarks
   *Note this is all done with Full reads, while we expect writes to be slower, 
reads should be faster when we actually do column specific projection. Since in 
this code the avro and parquet read paths are both doing full scans we don't 
expect them to be materially different.*
   
   <img width="1561" height="1086" alt="image" 
src="https://github.com/user-attachments/assets/3ef17093-3745-4417-8683-bcd81176f82a";
 />
   
   I also deleted the old Manifest benchmarks which were specific to V1, and 
V1/V2 respectively and replaced them with a new benchmark which can be used on 
any version
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to