Re: [DISCUSS] FileFormat API proposal

Péter Váry Sat, 18 Oct 2025 15:25:08 -0700

*Thanks to everyone who participated in the File Format Sync!*
I think we moved forward quite a bit!
You can find the recording here: https://www.youtube.com/watch?v=BvGejwOfn6w


Topics discussed:

   - *Exposing Variants*: We agreed to expose only
   *ReadBuilder.outputSchema* and *WriteBuilder.inputSchema *in the new
   API. These schemas will guide Variant shredding. We will not allow setting
   the *VariantShreddingFunction* on the new API. Exposing the engine
   specific schema solves another problem too: when a DataFrame contains
   values stored as *tinyint* or *shortint*, the writer can seamlessly
   convert and persist them as *integer* columns in the Iceberg table -
   without requiring any conversion logic from the engine itself.
   - *Generic vs. Non-Generic Builders*: We weighed the pros and cons of
   using generic types in ReadBuilder and WriteBuilder. While generics offer
   type safety, they don’t provide direct benefits since
   *FormatModelRegistry* must handle all *FormatModel*s and rely on callers
   to return correct types. Moreover, introducing generics would complicate
   implementation due to the need for proxy classes. Based on this, we
decided *not
   to use generic parameters* in the builders provided by the *FormatModels*
   .
   - *Deprecating Old APIs*: Since we’re not introducing generics, we
   concluded that the old APIs don’t need to be rewritten and can be reused.
   - *Method Names*: To maintain consistency, we’ll keep the existing
   method names and only make changes where necessary.
   - *Re-registering Format Models*: We need a mechanism to detect and
   allow re-registration of the same Format Model with the same keys, which
   may occur when a Flink Job Manager restarts or Spark reuses a JVM.
   - *ContentFileWriteBuilder Interfaces vs. Classes*: We chose to keep
   these as interfaces for greater flexibility, especially in scenarios
   involving inheritance.
   - *Packaging*: We’ll move the relevant classes into a dedicated package
   called *format*.

I’ll update the documentation and PRs accordingly.

If you couldn’t attend the meeting but have concerns about any of the
decisions above, please leave a comment so we can revisit them.
Thanks,
Peter


Péter Váry <[email protected]> ezt írta (időpont: 2025. szept.
25., Cs, 11:46):

> Hi Team,
>
> As discussed on the last Community Sync (with the help of Kevin Liu -
> Thanks!) we created a recurring *File Format API Sync on every second
> Monday at 9:00-10:00 AM PST / 6:00-7:00 PM CEST starting on 29th of
> September*. You can join using the following link:
> https://meet.google.com/fow-bauj-mmg or using the dev calendar.
>
> Also updated the proposal document:
> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
> Added a list of open questions to the end.
>
> I would be happy to see anyone who is interested, so we can speed up the
> process and move forward with the proposal.
> If you can't make it, or you can't wait :), feel free to ask your
> questions on the PRs (https://github.com/apache/iceberg/pull/12774,
> https://github.com/apache/iceberg/pull/12298), on the dev list or on
> Slack.
>
> Thanks,
> Peter
>
> Péter Váry <[email protected]> ezt írta (időpont: 2025. szept.
> 24., Sze, 14:35):
>
>> > I think there’s another option that would be cleaner than option 0.
>> Right now, the same FormatModel is being used for both position deletes
>> and data because they both share the object model type, like InternalRow for
>> Spark. But we’ve decided that we don’t intend to support writing position
>> deletes with row data, so there is no longer a need for
>> PositionDelete<InternalRow> and we can instead use PositionDelete<Void> (or
>> introduce an interface to generalize this).
>>
>> That’s a great suggestion. I’ve implemented it with a slight variation: I
>> used a PositionDelete<?> instead of PositionDelete<Void> to avoid requiring
>> users to update their code.
>> You can see how the proposed changes look in the PRs.
>>
>> > Remove unnecessary renames, like setAll -> set
>>
>> Several method name changes were made based on earlier reviewer
>> suggestions. At the time, each change seemed justified. That said, I agree
>> it's worth reviewing them. To help with this, I created a spreadsheet that
>> gives an overview of the changes. I've color-coded them for clarity:
>>
>>    - *Green*: unchanged
>>    - *Light green*: added for all readers (previously missing
>>    functionality)
>>    - *Yellow*: changed based on discussions
>>    - *Red*: removed
>>    - *Light yellow*: return type binding might have changed (only for
>>    build methods)
>>
>> You can find the document here:
>> https://docs.google.com/spreadsheets/d/1cBwyrO9-0x-OgquNhfBpkjfF__VRZgKBqujCnlAkXeY/edit?usp=sharing
>>
>> > Use and extend the existing format builders instead of creating new
>> ones to carry the return type.
>>
>> I think this deserves a deeper discussion. In the current PR, each
>> builder has two generic parameters:
>>
>>    - *WriteBuilder*
>>       - Input type: the type of object accepted (used to parameterize
>>       the Appender)
>>       - Input schema: the expected input schema used for Variant
>>       shredding
>>    - *ReadBuilder*
>>       - Output type: the type of object returned (used to parameterize
>>       the CloseableIterable)
>>       - Output schema: the expected output schema used for variant
>>       shredding
>>
>> While we could consider using Object for both types or rely on the caller
>> to set them correctly, both types are actually defined by the FormatModel.
>> If users provide incorrect types, it would lead to runtime exceptions
>> rather than compile-time errors—which I believe is poor API design.
>>
>> If we stick with generic parameters, we can’t reuse the existing
>> builders. In that case, we have two options (I’m fine with either, but the
>> second makes deprecating the old API easier):
>>
>>    1. Create a new class that implements the new API and internally uses
>>    the old class.
>>    2. Create a new class with the new API, copy the old implementation
>>    into it with necessary changes, and refactor the old class to delegate to
>>    the new one. The current PoC follows this second approach.
>>
>> I’d love to hear the community’s thoughts on this.
>>
>> Thanks,
>> Peter
>> Ryan Blue <[email protected]> ezt írta (időpont: 2025. szept. 23., K,
>> 20:28):
>>
>>> I think there’s another option that would be cleaner than option 0.
>>> Right now, the same FormatModel is being used for both position deletes
>>> and data because they both share the object model type, like InternalRow
>>> for Spark. But we’ve decided that we don’t intend to support writing
>>> position deletes with row data, so there is no longer a need for
>>> PositionDelete<InternalRow> and we can instead use PositionDelete<Void>
>>> (or introduce an interface to generalize this).
>>>
>>> That means we are either producing FileAppender<InternalRow> or
>>> FileAppender<PositionDelete<?>> and we already have a way to do that
>>> using the registry. I think it would be better/cleaner to register the same
>>> format twice, once for the row type and once for PositionDelete. That
>>> would be fairly easy to implement since the registry-based writer factory
>>> already resolves
>>> <https://github.com/apache/iceberg/pull/12298/files#diff-654036b55ad0af72c786a5c598a47c06400538a461d2db819535cb544a4a5d61R156-R157>
>>> the FormatModel every time. That also makes it easier to pass the
>>> content type for at least position deletes in the FormatModel that is
>>> registered.
>>>
>>> Unfortunately, I’m traveling tomorrow and can’t make it to the sync. But
>>> I did another pass at reviewing this today. In addition to updating the
>>> position deletes, I think there are some other changes that would help move
>>> this forward:
>>>
>>>    1. Remove unnecessary renames, like setAll -> set
>>>    2. Use and extend the existing format builders instead of creating
>>>    new ones to carry the return type. I don’t think we need to pass the row
>>>    type through the builder since the current way has been working just fine
>>>    for years.
>>>
>>> In general, I think this should avoid changing quite so much. This is
>>> already a lot to review and additional changes make it harder.
>>>
>>> Ryan
>>>
>>> On Mon, Sep 15, 2025 at 3:58 PM Steven Wu <[email protected]> wrote:
>>>
>>>> Peter, thanks for summarizing the 4 options. Both 0 and 1 seem good to
>>>> me, as they are explicit and easier to deprecate and remove the position
>>>> deletes in the future. Maybe option 0 is a tiny bit better as it is similar
>>>> to the existing FileWriterFactory API.
>>>>
>>>> I will leave PR related comments in the PR directly.
>>>>
>>>> On Mon, Sep 15, 2025 at 8:38 AM Péter Váry <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for the feedback @Russell and @Renjie!
>>>>>
>>>>> Updated the PR accordingly.
>>>>> Also removed the possibility to set the row schema for the
>>>>> position delete writer. We will not need that after the PDWR deprecation.
>>>>>
>>>>> You can see one possible implementation in
>>>>> https://github.com/apache/iceberg/pull/12298 - we can discuss that
>>>>> separately. I just made sure that the new API is able to serve all of the
>>>>> current needs.
>>>>>
>>>>> @Ryan: What are your thoughts?
>>>>>
>>>>> Are we in a stage when we can vote on the current API?
>>>>>
>>>>> Thanks,
>>>>> Peter
>>>>>
>>>>>
>>>>> Renjie Liu <[email protected]> ezt írta (időpont: 2025. szept.
>>>>> 15., H, 12:08):
>>>>>
>>>>>> I would also vote for option 0. This api has clean separation and
>>>>>> makes refactoring easier, e.g. when we completely deprecate v2 table, we
>>>>>> could mark the *positionDeleteWriteBuilder *method as deprecated,
>>>>>> and it would be easier to remove its usage.
>>>>>>
>>>>>> On Fri, Sep 12, 2025 at 11:24 PM Russell Spitzer <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Now that I fully understand the situation I think option 0 as you've
>>>>>>> written is probably the best thing to do as long as PositionDelete is a
>>>>>>> class. I
>>>>>>> think with hindsight it probably shouldn't have been a class and
>>>>>>> always been an interface so that our internal code could produce rows 
>>>>>>> which
>>>>>>> implement PositionDelete rather than PositionDeletes that wrap rows.
>>>>>>>
>>>>>>> On Fri, Sep 12, 2025 at 8:02 AM Péter Váry <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Let me summarize the state a bit:
>>>>>>>>
>>>>>>>> The FileFormat interface needs to expose two distinct methods:
>>>>>>>>
>>>>>>>>    - WriteBuilder<InternalRow>
>>>>>>>>    - WriteBuilder<PositionDelete<InternalRow>>
>>>>>>>>       - After the PDWR deprecation this will be
>>>>>>>>       WriteBuilder<PositionDelete>
>>>>>>>>       - After V2 deprecation this will be not needed anymore
>>>>>>>>
>>>>>>>> Based on the file format methods, the Registry must support four
>>>>>>>> builder types:
>>>>>>>>
>>>>>>>>    - WriteBuilder<InternalRow>
>>>>>>>>    - DataWriteBuilder<InternalRow>
>>>>>>>>    - EqualityDeleteWriteBuilder<InternalRow>
>>>>>>>>    - PositionDeleteWriteBuilder<InternalRow>
>>>>>>>>
>>>>>>>>
>>>>>>>> *API Design Considerations*
>>>>>>>> There is an argument that the two WriteBuilder methods provided by
>>>>>>>> FileFormat are essentially the same, differing only in the 
>>>>>>>> writerFunction.
>>>>>>>> While this is technically correct for current implementations, I 
>>>>>>>> believe
>>>>>>>> the API should clearly distinguish between the two writer types to
>>>>>>>> highlight the differences.
>>>>>>>>
>>>>>>>> *Discussed Approaches*
>>>>>>>>
>>>>>>>> *0. Two Explicit Methods on FormatModel* (removed based on
>>>>>>>> previous comments, but I personally still prefer this)
>>>>>>>>
>>>>>>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>>>>>>> *WriteBuilder<PositionDelete<InternalRow>>
>>>>>>>> positionDeleteWriteBuilder(OutputFile outputFile); *
>>>>>>>>
>>>>>>>>
>>>>>>>> Pros: Clear separation of responsibilities
>>>>>>>>
>>>>>>>> *1. One Builder + One Converter*
>>>>>>>>
>>>>>>>> *WriteBuilder<InternalRow> writeBuilder(OutputFile outputFile);*
>>>>>>>> *Function<PositionDelete<D>, D> positionDeleteConverter(Schema
>>>>>>>> schema);*
>>>>>>>>
>>>>>>>>
>>>>>>>> Pros: Keeps the interface compact
>>>>>>>> Cons: Requires additional documentation and understanding why the
>>>>>>>> conversion logic is needed
>>>>>>>>
>>>>>>>> *2. Single Method with Javadoc Clarification* (most similar to the
>>>>>>>> current approach)
>>>>>>>>
>>>>>>>> *WriteBuilder writeBuilder(OutputFile outputFile); *
>>>>>>>>
>>>>>>>>
>>>>>>>> Pros: Minimalistic
>>>>>>>> Cons: Least explicit; relies entirely on documentation
>>>>>>>>
>>>>>>>> *2/b. Single Builder with Type Parameter *(based on Russell's
>>>>>>>> suggestion)
>>>>>>>>
>>>>>>>> *WriteBuilder writeBuilder(OutputFile outputFile);*
>>>>>>>> *// Usage: builder.build(Class<D> inputType)*
>>>>>>>>
>>>>>>>>
>>>>>>>> Pros: Flexible
>>>>>>>> Cons: Relies on documentation to clarify the available input types
>>>>>>>>
>>>>>>>> *Bonus*
>>>>>>>> Options 0 and 1 make it easier to phase out PositionDelete
>>>>>>>> filtering once V2 tables are deprecated.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Peter
>>>>>>>>
>>>>>>>> Péter Váry <[email protected]> ezt írta (időpont: 2025.
>>>>>>>> szept. 11., Cs, 18:36):
>>>>>>>>
>>>>>>>>> > Wouldn't PositionDelete<InternalRow> also be an InternalRow in
>>>>>>>>> this example? I think that's what i'm confused about.
>>>>>>>>>
>>>>>>>>> With the *second approach*, the WriteBuilder doesn’t need to
>>>>>>>>> handle PositionDelete objects directly. The conversion layer
>>>>>>>>> takes care of that, so the WriteBuilder only needs to work with
>>>>>>>>> InternalRow.
>>>>>>>>>
>>>>>>>>> With the *first approach*, we shift that responsibility to the
>>>>>>>>> WriteBuilder, which then has to support both InternalRow and
>>>>>>>>> PositionDelete<InternalRow>.
>>>>>>>>>
>>>>>>>>> In both cases, the FormatModelRegistry API will still expose the
>>>>>>>>> more concrete types (PositionDelete / InternalRow). However,
>>>>>>>>> under the *first approach*, the lower-level API only needs to
>>>>>>>>> handle InternalRow, simplifying its interface.
>>>>>>>>> Thanks,
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>>> Russell Spitzer <[email protected]> ezt írta (időpont:
>>>>>>>>> 2025. szept. 11., Cs, 17:12):
>>>>>>>>>
>>>>>>>>>> Wouldn't PositionDelete<InternalRow> also be an InternalRow in
>>>>>>>>>> this example? I think that's what i'm confused about.
>>>>>>>>>>
>>>>>>>>>> On Thu, Sep 11, 2025 at 5:35 AM Péter Váry <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks, Russell, for taking a look at this!
>>>>>>>>>>>
>>>>>>>>>>> We need to expose four methods on the user-facing API (
>>>>>>>>>>> FormatModelRegistry):
>>>>>>>>>>>
>>>>>>>>>>>    1. *writeBuilder* – for writing arbitrary files without
>>>>>>>>>>>    Iceberg metadata. In the Iceberg codebase, this is exposed via
>>>>>>>>>>>    FlinkAppenderFactory and the GenericAppenderFactory for
>>>>>>>>>>>    creating FileAppender<RowData> and FileAppender<Record> only.
>>>>>>>>>>>    2. *dataWriteBuilder* – for creating and collecting metadata
>>>>>>>>>>>    for Iceberg DataFiles.
>>>>>>>>>>>    3. *equalityDeleteWriteBuilder* – for creating and
>>>>>>>>>>>    collecting metadata for Iceberg EqualityDeleteFiles.
>>>>>>>>>>>    4. *positionDeleteWriteBuilder* – for creating and
>>>>>>>>>>>    collecting metadata for Iceberg PositionDeleteFiles.
>>>>>>>>>>>
>>>>>>>>>>> We’d like to implement all four using a single WriteBuilder
>>>>>>>>>>> created by the FormatModels.
>>>>>>>>>>>
>>>>>>>>>>> Your suggestion is a good one—it helps formalize the
>>>>>>>>>>> requirements for the build method and also surfaces an
>>>>>>>>>>> important design question:
>>>>>>>>>>>
>>>>>>>>>>> *Who should be responsible for handling the differences between
>>>>>>>>>>> normal rows (InternalRow) and position deletes
>>>>>>>>>>> (PositionDelete<InternalRow>)*?
>>>>>>>>>>>
>>>>>>>>>>>    - Should we have a more complex WriteBuilder class that can
>>>>>>>>>>>    create both DataFileAppender and PositionDeleteAppender?
>>>>>>>>>>>    - Or should we push this responsibility to the
>>>>>>>>>>>    engine-specific code, where we already have some logic (e.g.,
>>>>>>>>>>>    pathTransformFunc) needed by each engine to create the
>>>>>>>>>>>    PositionDeleteAppender?
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Peter
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Russell Spitzer <[email protected]> ezt írta (időpont:
>>>>>>>>>>> 2025. szept. 11., Cs, 0:11):
>>>>>>>>>>>
>>>>>>>>>>>> I'm a little confused here, I think Ryan mentioned this in the
>>>>>>>>>>>> comment here
>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774/files#r2254967177
>>>>>>>>>>>>
>>>>>>>>>>>> From my understanding there are two options?
>>>>>>>>>>>>
>>>>>>>>>>>> 1) We either are producing FormatModels that take a generic row
>>>>>>>>>>>> type D and produce writers that all take D and write files.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) we are creating IcebergModel specific writers that take
>>>>>>>>>>>> DataFile, PositionDeleteFile, EqualityDeleteFile etc ... and write 
>>>>>>>>>>>> files
>>>>>>>>>>>>
>>>>>>>>>>>> The PositionDelete Converter issue seems to stem from
>>>>>>>>>>>> attempting to do both model 1 (being very generic) and 2, wanting 
>>>>>>>>>>>> special
>>>>>>>>>>>> code to deal with PositionDeleteFile<R> objects.
>>>>>>>>>>>>
>>>>>>>>>>>> It looks like the code in #12774 is mostly doing model 1, but
>>>>>>>>>>>> we are trying to add in a specific converter for 2?
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe i'm totally lost here but I was assuming we would do
>>>>>>>>>>>> something a little scala-y like
>>>>>>>>>>>>
>>>>>>>>>>>> public <T> FileAppender<T> build(Class<T> type) {
>>>>>>>>>>>> if (type == DataFile.class) return (FileAppender<T>) new
>>>>>>>>>>>> DataFileAppender();
>>>>>>>>>>>> if (type == DeleteFile.class) return (FileAppender<T>) new
>>>>>>>>>>>> DeleteFileAppender();
>>>>>>>>>>>> // ...
>>>>>>>>>>>> }
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> So that we only register a single signature and if writer
>>>>>>>>>>>> specific implementation needs to do something special it can? I'm 
>>>>>>>>>>>> trying to
>>>>>>>>>>>> catch back up to speed on this PR so it may help to do a quick 
>>>>>>>>>>>> summary of
>>>>>>>>>>>> the current state and intent. (At least for me)
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Sep 9, 2025 at 3:42 AM Péter Váry <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>> Thanks for taking a look!
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let me clarify a few points:
>>>>>>>>>>>>> - The converter API is only required for writing position
>>>>>>>>>>>>> delete files for V2 tables
>>>>>>>>>>>>> - Currently, there are no plans to support vectorized writing
>>>>>>>>>>>>> via the java API
>>>>>>>>>>>>> - Even if we decide to support vectorized writes, I don't
>>>>>>>>>>>>> think we would like to implement it for Positional Deletes, which 
>>>>>>>>>>>>> are
>>>>>>>>>>>>> deprecated in the new spec.
>>>>>>>>>>>>> - Also, once the positional deletes - which contain the
>>>>>>>>>>>>> deleted rows - are deprecated (as planned), the conversion of the 
>>>>>>>>>>>>> Position
>>>>>>>>>>>>> Deletes with only file name and position would be trivial, even 
>>>>>>>>>>>>> for the
>>>>>>>>>>>>> vectorized writes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So from my perspective, the converter method exists purely for
>>>>>>>>>>>>> backward compatibility, and we intend to remove it as soon as 
>>>>>>>>>>>>> possible.
>>>>>>>>>>>>> Sacrificing good practices for the sake of a deprecated feature 
>>>>>>>>>>>>> doesn’t
>>>>>>>>>>>>> seem worthwhile to me.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>
>>>>>>>>>>>>> Renjie Liu <[email protected]> ezt írta (időpont: 2025.
>>>>>>>>>>>>> szept. 8., H, 12:34):
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I would vote for the first approach. In spite of the
>>>>>>>>>>>>>> compromises described, the api is still cleaner. Also I think 
>>>>>>>>>>>>>> there are
>>>>>>>>>>>>>> some problems with the converter api. For example, for vectorized
>>>>>>>>>>>>>> implementations such as comet which accepts columnar batch 
>>>>>>>>>>>>>> rather than
>>>>>>>>>>>>>> rows, the converter method would make things more complicated.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Sat, Aug 30, 2025 at 2:49 PM Péter Váry <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I’ve initiated a discussion thread regarding the deprecation
>>>>>>>>>>>>>>> of Position Deletes containing row data. You can follow it here:
>>>>>>>>>>>>>>> https://lists.apache.org/thread/8jw6pb2vq3ghmdqf1yvy8n5n6gg1fq5s
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We can proceed with the discussion about the native
>>>>>>>>>>>>>>> reader/writer deprecation when we decided on the final API, as 
>>>>>>>>>>>>>>> the chosen
>>>>>>>>>>>>>>> design may influence our approach.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since then, one more question has come up - hopefully the
>>>>>>>>>>>>>>> last:
>>>>>>>>>>>>>>> *How should we handle Position Delete Writers?*
>>>>>>>>>>>>>>> The File Format API should return builders for either rows
>>>>>>>>>>>>>>> or PositionDelete objects. Currently the method
>>>>>>>>>>>>>>> `WriteBuilder.createWriterFunc(Function<MessageType,
>>>>>>>>>>>>>>> ParquetValueWriter<?>>)` defines the accepted input parameters 
>>>>>>>>>>>>>>> for the
>>>>>>>>>>>>>>> writer. Users are responsible for ensuring that the writer 
>>>>>>>>>>>>>>> function and the
>>>>>>>>>>>>>>> return type of the `WriteBuilder.build()` are compatible. In 
>>>>>>>>>>>>>>> the new API,
>>>>>>>>>>>>>>> we no longer expose writer functions. We still expose 
>>>>>>>>>>>>>>> FileContent, since
>>>>>>>>>>>>>>> writer configurations vary by content type, but we don’t expose 
>>>>>>>>>>>>>>> the types.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> There are two proposals for handling types for the
>>>>>>>>>>>>>>> WriteBuilders:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    1. *Implicit Type Definition via FileContent* - the
>>>>>>>>>>>>>>>    builder parameter for FileContent would implicitly define 
>>>>>>>>>>>>>>> the input type
>>>>>>>>>>>>>>>    for the writer returned by build(), or
>>>>>>>>>>>>>>>    2. *Engine level conversion* - Engines would convert
>>>>>>>>>>>>>>>    PositionDelete objects to their native types.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> In code:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - In the 1st proposal, the
>>>>>>>>>>>>>>>    FormatModel.writeBuilder(OutputFile outputFile) can return 
>>>>>>>>>>>>>>> anything:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *    WriteBuilder builder =
>>>>>>>>>>>>>>>    FormatModelRegistry.writeBuilder(PARQUET, InternalRow.class, 
>>>>>>>>>>>>>>> outputFile);
>>>>>>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>>>>>>        .content(FileContent.DATA)         ....         
>>>>>>>>>>>>>>> .build(); *
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *   // Exposed, but
>>>>>>>>>>>>>>>    FormatModelRegistry.positionDeleteWriteBuilder should be 
>>>>>>>>>>>>>>> used instead
>>>>>>>>>>>>>>>     WriteBuilder builder = 
>>>>>>>>>>>>>>> FormatModelRegistry.writeBuilder(PARQUET,
>>>>>>>>>>>>>>>    InternalRow.class, outputFile);
>>>>>>>>>>>>>>>     FileAppender<PositionDelete<InternalRow>> appender =
>>>>>>>>>>>>>>>    .schema(table.schema())         
>>>>>>>>>>>>>>> .content(FileContent.POSITION_DELETES)
>>>>>>>>>>>>>>>        ....         .build();*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - In the 2nd proposal, the FormatModel needs another
>>>>>>>>>>>>>>>    method:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *Function<PositionDelete<D>, D>
>>>>>>>>>>>>>>> positionDeleteConverter(Schema schema);    *example
>>>>>>>>>>>>>>> implementation:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *    return delete -> {      deleteRecord.update(0,
>>>>>>>>>>>>>>> UTF8String.fromString(delete.path().toString()));
>>>>>>>>>>>>>>> deleteRecord.update(1, delete.pos());      
>>>>>>>>>>>>>>> deleteRecord.update(2,
>>>>>>>>>>>>>>> delete.row());      return deleteRecord;    };*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> *    // Content is only used for writer property
>>>>>>>>>>>>>>>    configuration     WriteBuilder<InternalRow> builder =
>>>>>>>>>>>>>>>    sparkFormatModel.writeBuilder(outputFile);
>>>>>>>>>>>>>>>      FileAppender<InternalRow> appender =         
>>>>>>>>>>>>>>> .schema(table.schema())
>>>>>>>>>>>>>>>        .content(FileContent.DATA)         ....         
>>>>>>>>>>>>>>> .build();*
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Drawbacks
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>    - Proposal 1:
>>>>>>>>>>>>>>>       - Type checking for the FileAppenders occurs only at
>>>>>>>>>>>>>>>       runtime, so user errors surface late.
>>>>>>>>>>>>>>>       - File Format specification must clearly specify
>>>>>>>>>>>>>>>       which builder type corresponds to which file content 
>>>>>>>>>>>>>>> parameter—generics
>>>>>>>>>>>>>>>       would offer better clarity.
>>>>>>>>>>>>>>>       - Inconsistent patterns between WriteBuilder and
>>>>>>>>>>>>>>>       ReadBuilder, as the latter can define output types via 
>>>>>>>>>>>>>>> generics.
>>>>>>>>>>>>>>>    - Proposal 2:
>>>>>>>>>>>>>>>       - Requires FormatModels to implement a converter
>>>>>>>>>>>>>>>       method to transform PositionDelete<InternalRow> into 
>>>>>>>>>>>>>>> InternalRow.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Since we deprecated writing position delete files in the V3
>>>>>>>>>>>>>>> spec, this extra method in the 2nd proposal will be deprecated 
>>>>>>>>>>>>>>> too. As a
>>>>>>>>>>>>>>> result, in the long run, we will have a nice, clean API.
>>>>>>>>>>>>>>> OTOH, if we accept the compromise described in the 1st
>>>>>>>>>>>>>>> proposal, the results of our decision will remain, even when 
>>>>>>>>>>>>>>> the functions
>>>>>>>>>>>>>>> are removed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Looking forward to your thoughts.
>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Aug 14, 2025, 14:12 Péter Váry <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi Team,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> During yesterday’s community sync, we discussed the current
>>>>>>>>>>>>>>>> state of the File Format API proposal and identified two key 
>>>>>>>>>>>>>>>> questions that
>>>>>>>>>>>>>>>> require input from the broader community:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *1. Dropping support for Position Delete files with Row
>>>>>>>>>>>>>>>> Data*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current Iceberg V2 spec [1] defines two types of
>>>>>>>>>>>>>>>> position delete files:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - Files that store only the file name and row position.
>>>>>>>>>>>>>>>>    - Files that also store the deleted row data.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Although this feature is defined in the spec and some tests
>>>>>>>>>>>>>>>> exist in the Iceberg codebase, we’re not aware of any actual 
>>>>>>>>>>>>>>>> implementation
>>>>>>>>>>>>>>>> using the second type (with row data). Supporting V2 table 
>>>>>>>>>>>>>>>> writing via the
>>>>>>>>>>>>>>>> new File Format API would be simpler if we dropped support for 
>>>>>>>>>>>>>>>> this feature.
>>>>>>>>>>>>>>>> If you know of any use case or reason to retain support for
>>>>>>>>>>>>>>>> position deletes with row data, please let us know.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> *2. Deprecating Native File Format Readers/Writers in the
>>>>>>>>>>>>>>>> API*
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The current API contains format-specific readers/writers
>>>>>>>>>>>>>>>> for Parquet, Avro, and ORC. With the introduction of the 
>>>>>>>>>>>>>>>> InternalData and
>>>>>>>>>>>>>>>> File Format APIs, Iceberg users can now write files using:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>    - InternalData API for metadata files (manifest,
>>>>>>>>>>>>>>>>    manifest list, partition stats).
>>>>>>>>>>>>>>>>    - File Format API for data and delete files.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I propose we deprecate the original format-specific writers
>>>>>>>>>>>>>>>> and guide users to use the new APIs based on the target file 
>>>>>>>>>>>>>>>> type. If
>>>>>>>>>>>>>>>> you’re aware of any use cases that still require the original
>>>>>>>>>>>>>>>> format-specific writers, please share them.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> [1] - Position Delete File Spec:
>>>>>>>>>>>>>>>> https://iceberg.apache.org/spec/?h=delete#position-delete-files
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>> (időpont: 2025. júl. 22., K, 16:09):
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Also put together a solution where the Engine specific
>>>>>>>>>>>>>>>>> format transformation is separated from the writer, and the 
>>>>>>>>>>>>>>>>> engines need to
>>>>>>>>>>>>>>>>> take care of it separately.
>>>>>>>>>>>>>>>>> This is somewhat complicated on the implementation side
>>>>>>>>>>>>>>>>> (see: [RowDataTransformer](
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-562fa4cc369c908a157f59a9235fd3f389096451e7901686fba37c87b53dee08),
>>>>>>>>>>>>>>>>> and [InternalRowTransformer](
>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298/files#diff-546f9dc30e3207d1d2bc0a2722976b55f5a04dcf85a22855e4f400500c317140)),
>>>>>>>>>>>>>>>>> but simplifies the API.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> @rdblue: Please check the proposed solution. I think this
>>>>>>>>>>>>>>>>> is what you have suggested
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>> (időpont: 2025. jún. 30., H, 18:42):
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> During the PR review [1], we began exploring what could
>>>>>>>>>>>>>>>>>> we use as an intermediate layer to reduce the need for 
>>>>>>>>>>>>>>>>>> engines and file
>>>>>>>>>>>>>>>>>> formats to implement the full matrix of file format - object 
>>>>>>>>>>>>>>>>>> model
>>>>>>>>>>>>>>>>>> conversions.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> To support this discussion, I’ve created and run a set of
>>>>>>>>>>>>>>>>>> performance benchmarks and compiled a document outlining the 
>>>>>>>>>>>>>>>>>> potential
>>>>>>>>>>>>>>>>>> benefits and trade-offs [2].
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Feedback is welcome, feel free to comment on the
>>>>>>>>>>>>>>>>>> document, the PR, or directly in this thread.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> [1] - PR discussion -
>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774#discussion_r2093626096
>>>>>>>>>>>>>>>>>> [2] - File Format and engine object model transformation
>>>>>>>>>>>>>>>>>> performance -
>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1GdA8IowKMtS3QVdm8s-0X-ZRYetcHv2bhQ9mrSd3fd4
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>> (időpont: 2025. máj. 7., Sze, 13:15):
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Hi everyone,
>>>>>>>>>>>>>>>>>>> The proposed API part is reviewed and ready to go. See:
>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>>>>>> Thanks to everyone who reviewed it already!
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Many of you wanted to review, but I know that the time
>>>>>>>>>>>>>>>>>>> constraints are there for everyone. I still very much would 
>>>>>>>>>>>>>>>>>>> like to hear
>>>>>>>>>>>>>>>>>>> your voices, so I will not merge the PR this week. Please 
>>>>>>>>>>>>>>>>>>> review it if you.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>>>> Peter
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Péter Váry <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 16., Sze, 7:02):
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> Hi Renjie,
>>>>>>>>>>>>>>>>>>>> The first one for the proposed new API is here:
>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774
>>>>>>>>>>>>>>>>>>>> Thanks, Peter
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>> On Wed, Apr 16, 2025, 05:40 Renjie Liu <
>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Hi, Peter:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> Thanks for the effort. I totally agree with splitting
>>>>>>>>>>>>>>>>>>>>> them into smaller prs to move forward.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> I'm quite interested in this topic, and please ping me
>>>>>>>>>>>>>>>>>>>>> in those splitted prs and I'll help to review.
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>> On Mon, Apr 14, 2025 at 11:22 PM Jean-Baptiste Onofré <
>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Hi Peter
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Awesome ! Thank you so much !
>>>>>>>>>>>>>>>>>>>>>> I will do a new pass.
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> JB
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>> On Fri, Apr 11, 2025 at 3:48 PM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>> > Hi JB,
>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>> > Separated out the proposed interfaces to a new PR:
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12774.
>>>>>>>>>>>>>>>>>>>>>> > Reviewers can check that out if they are only
>>>>>>>>>>>>>>>>>>>>>> interested in how the new API would look like.
>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>>>>>>>>> > Peter
>>>>>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>>>>>> > Jean-Baptiste Onofré <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>>>>> (időpont: 2025. ápr. 10., Cs, 18:25):
>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> Hi Peter
>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> Thanks for the ping about the PR.
>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> Maybe, to facilitate the review and move forward
>>>>>>>>>>>>>>>>>>>>>> faster, we should
>>>>>>>>>>>>>>>>>>>>>> >> split the PR in smaller PRs:
>>>>>>>>>>>>>>>>>>>>>> >> - one with the interfaces (ReadBuilder,
>>>>>>>>>>>>>>>>>>>>>> AppenderBuilder, ObjectModel,
>>>>>>>>>>>>>>>>>>>>>> >> AppenderBuilder, DataWriterBuilder, ...)
>>>>>>>>>>>>>>>>>>>>>> >> - one for each file providers (Parquet, Avro, ORC)
>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> Thoughts ? I can help on the split if needed.
>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> Regards
>>>>>>>>>>>>>>>>>>>>>> >> JB
>>>>>>>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> On Thu, Apr 10, 2025 at 5:16 AM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Since the 1.9.0 release candidate has been
>>>>>>>>>>>>>>>>>>>>>> created, I would like to resurrect this PR:
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298 to
>>>>>>>>>>>>>>>>>>>>>> ensure that we have as long a testing period as possible 
>>>>>>>>>>>>>>>>>>>>>> for it.
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > To recap, here is what the PR does after the
>>>>>>>>>>>>>>>>>>>>>> review rounds:
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Created 3 interface classes which are
>>>>>>>>>>>>>>>>>>>>>> implemented by the file formats:
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > ReadBuilder - Builder for reading data from data
>>>>>>>>>>>>>>>>>>>>>> files
>>>>>>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing data to
>>>>>>>>>>>>>>>>>>>>>> data files
>>>>>>>>>>>>>>>>>>>>>> >> > ObjectModel - Providing ReadBuilders, and
>>>>>>>>>>>>>>>>>>>>>> AppenderBuilders for the specific data file format and 
>>>>>>>>>>>>>>>>>>>>>> object model pair
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Updated the Parquet, Avro, ORC implementation
>>>>>>>>>>>>>>>>>>>>>> for this interfaces, and deprecated the old 
>>>>>>>>>>>>>>>>>>>>>> reader/writer APIs
>>>>>>>>>>>>>>>>>>>>>> >> > Created interface classes which will be used by
>>>>>>>>>>>>>>>>>>>>>> the actual readers/writers of the data files:
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > AppenderBuilder - Builder for writing a file
>>>>>>>>>>>>>>>>>>>>>> >> > DataWriterBuilder - Builder for generating a
>>>>>>>>>>>>>>>>>>>>>> data file
>>>>>>>>>>>>>>>>>>>>>> >> > PositionDeleteWriterBuilder - Builder for
>>>>>>>>>>>>>>>>>>>>>> generating a position delete file
>>>>>>>>>>>>>>>>>>>>>> >> > EqualityDeleteWriterBuilder - Builder for
>>>>>>>>>>>>>>>>>>>>>> generating an equality delete file
>>>>>>>>>>>>>>>>>>>>>> >> > No ReadBuilder here - the file format reader
>>>>>>>>>>>>>>>>>>>>>> builder is reused
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Created a WriterBuilder class which implements
>>>>>>>>>>>>>>>>>>>>>> the interfaces above
>>>>>>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>>>>>>> based on a provided file format specific AppenderBuilder
>>>>>>>>>>>>>>>>>>>>>> >> > Created an ObjectModelRegistry which stores the
>>>>>>>>>>>>>>>>>>>>>> available ObjectModels, and engines and users could 
>>>>>>>>>>>>>>>>>>>>>> request the readers
>>>>>>>>>>>>>>>>>>>>>> (ReadBuilder) and writers
>>>>>>>>>>>>>>>>>>>>>> (AppenderBuilder/DataWriterBuilder/PositionDeleteWriterBuilder/EqualityDeleteWriterBuilder)
>>>>>>>>>>>>>>>>>>>>>> from.
>>>>>>>>>>>>>>>>>>>>>> >> > Created the appropriate ObjectModels:
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > GenericObjectModels - for reading and writing
>>>>>>>>>>>>>>>>>>>>>> Iceberg Records
>>>>>>>>>>>>>>>>>>>>>> >> > SparkObjectModels - for reading (vectorized and
>>>>>>>>>>>>>>>>>>>>>> non-vectorized) and writing Spark 
>>>>>>>>>>>>>>>>>>>>>> InternalRow/ColumnarBatch objects
>>>>>>>>>>>>>>>>>>>>>> >> > FlinkObjectModels - for reading and writing
>>>>>>>>>>>>>>>>>>>>>> Flink RowData objects
>>>>>>>>>>>>>>>>>>>>>> >> > An arrow object model is also registered for
>>>>>>>>>>>>>>>>>>>>>> vectorized reads of Parquet files into Arrow 
>>>>>>>>>>>>>>>>>>>>>> ColumnarBatch objects
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Updated the production code where the reading
>>>>>>>>>>>>>>>>>>>>>> and writing happens to use the ObjectModelRegistry and 
>>>>>>>>>>>>>>>>>>>>>> the new
>>>>>>>>>>>>>>>>>>>>>> reader/writer interfaces to access data files
>>>>>>>>>>>>>>>>>>>>>> >> > Kept the testing code intact to ensure that the
>>>>>>>>>>>>>>>>>>>>>> new API/code is not breaking anything
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > The original change was not small, and grew
>>>>>>>>>>>>>>>>>>>>>> substantially during the review rounds. So if you have 
>>>>>>>>>>>>>>>>>>>>>> questions, or I can
>>>>>>>>>>>>>>>>>>>>>> do anything to make the review easier, don't hesitate to 
>>>>>>>>>>>>>>>>>>>>>> ask. I am happy to
>>>>>>>>>>>>>>>>>>>>>> do anything to move this forward.
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Thanks,
>>>>>>>>>>>>>>>>>>>>>> >> > Peter
>>>>>>>>>>>>>>>>>>>>>> >> >
>>>>>>>>>>>>>>>>>>>>>> >> > Péter Váry <[email protected]> ezt
>>>>>>>>>>>>>>>>>>>>>> írta (időpont: 2025. márc. 26., Sze, 14:54):
>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>> >> >> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>> >> >> I have updated the File Format API PR (
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) based
>>>>>>>>>>>>>>>>>>>>>> on the answers and review comments.
>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>> >> >> I would like to merge this only after the 1.9.0
>>>>>>>>>>>>>>>>>>>>>> release so we have more time finding any issues and 
>>>>>>>>>>>>>>>>>>>>>> solving them before
>>>>>>>>>>>>>>>>>>>>>> this goes to a release for the users.
>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>> >> >> For this I have updated the deprecation
>>>>>>>>>>>>>>>>>>>>>> comments accordingly.
>>>>>>>>>>>>>>>>>>>>>> >> >> I would like to ask you to review the PR, so we
>>>>>>>>>>>>>>>>>>>>>> iron out any possible requested changes and be ready for 
>>>>>>>>>>>>>>>>>>>>>> the merge as soon
>>>>>>>>>>>>>>>>>>>>>> as possible after the 1.9.0 release.
>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>> >> >> Thanks,
>>>>>>>>>>>>>>>>>>>>>> >> >> Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>
>>>>>>>>>>>>>>>>>>>>>> >> >> Péter Váry <[email protected]> ezt
>>>>>>>>>>>>>>>>>>>>>> írta (időpont: 2025. márc. 21., P, 14:32):
>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>> Hi Renije,
>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>> > 1. File format filters
>>>>>>>>>>>>>>>>>>>>>> >> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>> > Do the filters include both filter
>>>>>>>>>>>>>>>>>>>>>> expressions from both user query and delete filter?
>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>> The current discussion is about the filters
>>>>>>>>>>>>>>>>>>>>>> from the user query.
>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>> About the delete filter:
>>>>>>>>>>>>>>>>>>>>>> >> >>> Based on the suggestions on the PR, I have
>>>>>>>>>>>>>>>>>>>>>> moved the delete filter out from the main API. Created a
>>>>>>>>>>>>>>>>>>>>>> `SupportsDeleteFilter` interface for it which would 
>>>>>>>>>>>>>>>>>>>>>> allow pushing down to
>>>>>>>>>>>>>>>>>>>>>> the filter to Parquet vectorized readers in Spark, as 
>>>>>>>>>>>>>>>>>>>>>> this is the only
>>>>>>>>>>>>>>>>>>>>>> place where we currently implemented this feature.
>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>> Renjie Liu <[email protected]> ezt írta
>>>>>>>>>>>>>>>>>>>>>> (időpont: 2025. márc. 21., P, 14:11):
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> Hi, Peter:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> Thanks for the effort on this.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> 1. File format filters
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> Do the filters include both filter
>>>>>>>>>>>>>>>>>>>>>> expressions from both user query and delete filter?
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> For filters from user query, I agree with you
>>>>>>>>>>>>>>>>>>>>>> that we should keep the current behavior.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> For delete filters associated with data
>>>>>>>>>>>>>>>>>>>>>> files, at first I thought file format readers should not 
>>>>>>>>>>>>>>>>>>>>>> care about this.
>>>>>>>>>>>>>>>>>>>>>> But now I realized that maybe we need to also push it to 
>>>>>>>>>>>>>>>>>>>>>> file reader, this
>>>>>>>>>>>>>>>>>>>>>> is useful when `IS_DELETED` metadata column is not 
>>>>>>>>>>>>>>>>>>>>>> necessary and we could
>>>>>>>>>>>>>>>>>>>>>> use these filters (position deletes, etc) to further 
>>>>>>>>>>>>>>>>>>>>>> prune data.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> But anyway, I agree that we could postpone it
>>>>>>>>>>>>>>>>>>>>>> in follow up pr.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> 2. Batch size configuration
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> I'm leaning toward option 2.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> 3. Spark configuration
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> I'm leaning towards using different
>>>>>>>>>>>>>>>>>>>>>> configuration objects.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>> On Thu, Mar 20, 2025 at 10:23 PM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Hi Team,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Thanks everyone for the reviews on
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298!
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> I have addressed most of comments, but a few
>>>>>>>>>>>>>>>>>>>>>> questions still remain which might merit a bit wider 
>>>>>>>>>>>>>>>>>>>>>> audience:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> We should decide on the expected filtering
>>>>>>>>>>>>>>>>>>>>>> behavior when the filters are pushed down to the 
>>>>>>>>>>>>>>>>>>>>>> readers. Currently the
>>>>>>>>>>>>>>>>>>>>>> filters are applied as best effort for the file format 
>>>>>>>>>>>>>>>>>>>>>> readers. Some
>>>>>>>>>>>>>>>>>>>>>> readers (Avro) just skip them altogether. There was a 
>>>>>>>>>>>>>>>>>>>>>> suggestion on the PR
>>>>>>>>>>>>>>>>>>>>>> that we might enforce more strict requirements and the 
>>>>>>>>>>>>>>>>>>>>>> readers either
>>>>>>>>>>>>>>>>>>>>>> reject part of the filters, or they could apply them 
>>>>>>>>>>>>>>>>>>>>>> fully.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Batch sizes are currently parameters for the
>>>>>>>>>>>>>>>>>>>>>> reader builders which could be set for non-vectorized 
>>>>>>>>>>>>>>>>>>>>>> readers too which
>>>>>>>>>>>>>>>>>>>>>> could be confusing.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Currently the Spark batch reader uses
>>>>>>>>>>>>>>>>>>>>>> different configuration objects for ParquetBatchReadConf 
>>>>>>>>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>>>>>>>> OrcBatchReadConf as requested by the reviewers of the 
>>>>>>>>>>>>>>>>>>>>>> Comet PR. There was a
>>>>>>>>>>>>>>>>>>>>>> suggestion on the current PR to use a common 
>>>>>>>>>>>>>>>>>>>>>> configuration instead.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> I would be interested in hearing your
>>>>>>>>>>>>>>>>>>>>>> thoughts about these topics.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> My current take:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> File format filters: I am leaning towards
>>>>>>>>>>>>>>>>>>>>>> keeping the current laninet behavior. Especially since 
>>>>>>>>>>>>>>>>>>>>>> Bloom filters are
>>>>>>>>>>>>>>>>>>>>>> not able to do a full filtering, and are often used as a 
>>>>>>>>>>>>>>>>>>>>>> way to filter out
>>>>>>>>>>>>>>>>>>>>>> unwanted records. Another option would be to implement a 
>>>>>>>>>>>>>>>>>>>>>> secondary
>>>>>>>>>>>>>>>>>>>>>> filtering inside the file formats themselves which I 
>>>>>>>>>>>>>>>>>>>>>> think would cause
>>>>>>>>>>>>>>>>>>>>>> extra complexity, and possible code duplication. 
>>>>>>>>>>>>>>>>>>>>>> Whatever the decision
>>>>>>>>>>>>>>>>>>>>>> here, I would suggest moving this out to a next PR as 
>>>>>>>>>>>>>>>>>>>>>> the current changeset
>>>>>>>>>>>>>>>>>>>>>> is big enough as it is.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Batch size configuration: Currently this is
>>>>>>>>>>>>>>>>>>>>>> the only property which is different in the batch 
>>>>>>>>>>>>>>>>>>>>>> readers and the
>>>>>>>>>>>>>>>>>>>>>> non-vectorized readers. I see 3 possible solutions:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Create different builders for vectorized and
>>>>>>>>>>>>>>>>>>>>>> non-vectorized reads - I don't think the current 
>>>>>>>>>>>>>>>>>>>>>> solution is confusing
>>>>>>>>>>>>>>>>>>>>>> enough to worth the extra class
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> We could put this to the reader
>>>>>>>>>>>>>>>>>>>>>> configuration property set - This could work, but "hide" 
>>>>>>>>>>>>>>>>>>>>>> the possible
>>>>>>>>>>>>>>>>>>>>>> configuration mode which is valid for both Parquet and 
>>>>>>>>>>>>>>>>>>>>>> ORC readers
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> We could keep things as it is now - I would
>>>>>>>>>>>>>>>>>>>>>> chose this one, but I don't have a strong opinion here
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Spark configuration: TBH, I'm open to bot
>>>>>>>>>>>>>>>>>>>>>> solution and happy to move to the direction the 
>>>>>>>>>>>>>>>>>>>>>> community decides on
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Thanks,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>> Jean-Baptiste Onofré <[email protected]> ezt
>>>>>>>>>>>>>>>>>>>>>> írta (időpont: 2025. márc. 14., P, 16:31):
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> Hi Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> Thanks for the update. I will do a new pass
>>>>>>>>>>>>>>>>>>>>>> on the PR.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> Regards
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> JB
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> On Thu, Mar 13, 2025 at 1:16 PM Péter Váry <
>>>>>>>>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Hi Team,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > I have rebased the File Format API
>>>>>>>>>>>>>>>>>>>>>> proposal (
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298) to
>>>>>>>>>>>>>>>>>>>>>> include the new changes needed for the Variant types. I 
>>>>>>>>>>>>>>>>>>>>>> would love to hear
>>>>>>>>>>>>>>>>>>>>>> your feedback, especially Dan and Ryan, as you were the 
>>>>>>>>>>>>>>>>>>>>>> most active during
>>>>>>>>>>>>>>>>>>>>>> our discussions. If I can help in any way to make the 
>>>>>>>>>>>>>>>>>>>>>> review easier, please
>>>>>>>>>>>>>>>>>>>>>> let me know.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Thanks,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> > Péter Váry <[email protected]>
>>>>>>>>>>>>>>>>>>>>>> ezt írta (időpont: 2025. febr. 28., P, 17:50):
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Hi everyone,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks for all of the actionable,
>>>>>>>>>>>>>>>>>>>>>> relevant feedback on the PR (
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/pull/12298).
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Updated the code to address most of
>>>>>>>>>>>>>>>>>>>>>> them. Please check if you agree with the general 
>>>>>>>>>>>>>>>>>>>>>> approach.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> If there is a consensus about the
>>>>>>>>>>>>>>>>>>>>>> general approach, I could. separate out the PR to 
>>>>>>>>>>>>>>>>>>>>>> smaller pieces so we can
>>>>>>>>>>>>>>>>>>>>>> have an easier time to review and merge those 
>>>>>>>>>>>>>>>>>>>>>> step-by-step.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Thanks,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >> Jean-Baptiste Onofré <[email protected]>
>>>>>>>>>>>>>>>>>>>>>> ezt írta (időpont: 2025. febr. 20., Cs, 14:14):
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Hi Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> sorry for the late reply on this.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> I did a pass on the proposal, it's very
>>>>>>>>>>>>>>>>>>>>>> interesting and well written.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> I like the DataFile API and definitely
>>>>>>>>>>>>>>>>>>>>>> worth to discuss all together.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Maybe we can schedule a specific
>>>>>>>>>>>>>>>>>>>>>> meeting to discuss about DataFile API ?
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Thoughts ?
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> Regards
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> JB
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>>
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> On Tue, Feb 11, 2025 at 5:46 PM Péter
>>>>>>>>>>>>>>>>>>>>>> Váry <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Hi Team,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > As mentioned earlier on our Community
>>>>>>>>>>>>>>>>>>>>>> Sync I am exploring the possibility to define a 
>>>>>>>>>>>>>>>>>>>>>> FileFormat API for
>>>>>>>>>>>>>>>>>>>>>> accessing different file formats. I have put together a 
>>>>>>>>>>>>>>>>>>>>>> proposal based on
>>>>>>>>>>>>>>>>>>>>>> my findings.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Iceberg currently supports 3
>>>>>>>>>>>>>>>>>>>>>> different file formats: Avro, Parquet, ORC. With the 
>>>>>>>>>>>>>>>>>>>>>> introduction of
>>>>>>>>>>>>>>>>>>>>>> Iceberg V3 specification many new features are added to 
>>>>>>>>>>>>>>>>>>>>>> Iceberg. Some of
>>>>>>>>>>>>>>>>>>>>>> these features like new column types, default values 
>>>>>>>>>>>>>>>>>>>>>> require changes at the
>>>>>>>>>>>>>>>>>>>>>> file format level. The changes are added by individual 
>>>>>>>>>>>>>>>>>>>>>> developers with
>>>>>>>>>>>>>>>>>>>>>> different focus on the different file formats. As a 
>>>>>>>>>>>>>>>>>>>>>> result not all of the
>>>>>>>>>>>>>>>>>>>>>> features are available for every supported file format.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Also there are emerging file formats
>>>>>>>>>>>>>>>>>>>>>> like Vortex [1] or Lance [2] which either by 
>>>>>>>>>>>>>>>>>>>>>> specialization, or by applying
>>>>>>>>>>>>>>>>>>>>>> newer research results could provide better alternatives 
>>>>>>>>>>>>>>>>>>>>>> for certain
>>>>>>>>>>>>>>>>>>>>>> use-cases like random access for data, or storing ML 
>>>>>>>>>>>>>>>>>>>>>> models.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > -------------------
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Please check the detailed proposal
>>>>>>>>>>>>>>>>>>>>>> [3] and the google document [4], and comment there or 
>>>>>>>>>>>>>>>>>>>>>> reply on the dev list
>>>>>>>>>>>>>>>>>>>>>> if you have any suggestions.
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Thanks,
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > Peter
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [1] -
>>>>>>>>>>>>>>>>>>>>>> https://github.com/spiraldb/vortex
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [2] -
>>>>>>>>>>>>>>>>>>>>>> https://lancedb.github.io/lance/
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [3] -
>>>>>>>>>>>>>>>>>>>>>> https://github.com/apache/iceberg/issues/12225
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> > [4] -
>>>>>>>>>>>>>>>>>>>>>> https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds
>>>>>>>>>>>>>>>>>>>>>> >> >>>>>> >>> >
>>>>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>>>

Re: [DISCUSS] FileFormat API proposal

Reply via email to