Hi all,

After updating the document, I didn't get much additional feedback. As the
next step I submitted PRs for the reference implementation and changes:
 - parquet-format: https://github.com/apache/parquet-format/pull/585
 - parquet-java: https://github.com/apache/parquet-java/pull/3608
 - arrow-rs: https://github.com/apache/arrow-rs/pull/10109

Look forward to feedback on these changes as well!

Thanks,
Burak

On Fri, Jun 5, 2026 at 9:01 AM Daniel Weeks <[email protected]> wrote:

> Hey everyone,
>
> I had an action item to follow up and provide more context based on the
> short discussion during the sync (some is recap of what Burak already said
> above).
>
> I don't seem to have access to the video, so I can't provide a timestamped
> link, but can share the high-level takeaways:
>
> There was a fair bit of discussion back and forth in the doc around some of
> the fields (especially metadata and content_type).  In the end, what I feel
> resonated most with everyone is that if we're creating new primitive types,
> we should define them as narrowly as possible (don't include a bunch of
> extra fields with hypothetical use cases).  We also looked across other
> implementations, and while there was some variation, Burak's updated
> proposal seems consistent where most of the representations.
>
> If users want to include additional information, it makes more sense to
> carry that information in neighboring fields as it quickly shifts to more
> specific use cases.
>
> Thanks Burak for the quick turnaround!
>
> -Dan
>
> On Fri, Jun 5, 2026 at 8:37 AM Micah Kornfield <[email protected]>
> wrote:
>
> > >
> > > If there are no strong arguments against the current proposal, may I
> > follow
> > > up with a pull request to apache/parquet-format
> > > <https://github.com/apache/parquet-format>? What would be the next
> > steps?
> > > Or would I need to start a vote first?
> >
> > Hi Burak,
> > New feature steps are listed in the format contributors guide [1].  If
> > there are no objections we can move to step 2 (completeness): A PR
> against
> > parquet-format and updates to the reference implementations (hopefully
> > these are pretty trivial for this case).
> >
> > I think we can probably start the PRs next week to give people a chance
> to
> > digest the current proposal and speakup if there are hard objections.
> >
> > Cheers,
> > Micah
> >
> >
> > [1]
> >
> >
> https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format
> >
> >
> > On Fri, Jun 5, 2026 at 8:25 AM Burak Yavuz <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > Thank you all for the great discussion on the document! I made another
> > pass
> > > on the doc. During the Parquet sync, there was alignment around keeping
> > the
> > > field as simple and minimalistic as possible. I updated the doc in that
> > way
> > > (removed content_type from the field) to ensure that the fields
> available
> > > are all functional fields for correctly reading a file.
> > >
> > > Please let me know if you have more feedback!
> > >
> > > If there are no strong arguments against the current proposal, may I
> > follow
> > > up with a pull request to apache/parquet-format
> > > <https://github.com/apache/parquet-format>? What would be the next
> > steps?
> > > Or would I need to start a vote first?
> > >
> > > Thanks,
> > > Burak
> > >
> > > On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote:
> > >
> > > > Hello all,
> > > >
> > > > I'm sharing the design document for File Type here
> > > > <
> > >
> >
> https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing
> > > >.
> > > > Please let me know what you think!
> > > > Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for
> > their
> > > > feedback working on this document.
> > > >
> > > > Steve, regarding your questions, my thoughts are inline:
> > > > >    1. small inline blob somewhere within the parquet file (|data| =
> > > > bytes)
> > > > We have a lot of design options here. Does it need to be part of
> > "File"?
> > > > That's debatable. Engines/table formats can decide to coalesce a File
> > > > reference with an inline value when available for example. Carrying
> an
> > > > inline binary blob may make analytics workloads more inefficient,
> > > > specifically if you have to carry them around as baggage through
> sorts
> > > and
> > > > shuffles.
> > > >
> > > > > 2. Medium blob: data stored range limited within a larger file
> > (|data|
> > > =
> > > >    kilo to megabytes)
> > > > Again, can be up to a table format to decide creating sidecar files,
> > > where
> > > > the sidecar may be built on top of these file references.
> > > >
> > > > > 3. completely separate file (GB +), or somehow the data lifecycle
> > isn't
> > > >    managed with parquet file.
> > > >
> > > > This file reference solves this problem as well.
> > > >
> > > > > lifecycle management you don't want to discover that your photo
> > > >    collection has been deleted by accident, and a data rewrite such
> as
> > > >    applying DVs shouldn't mandate rebuilding of external binary
> files.
> > > > > security, esp when providing credential access to tables.
> Credential
> > > >    providers would also need to provide file access, so have to know
> > > which
> > > >    binary files are associated with parquet files, somehow.
> > > >
> > > > These all sound like problems that should be handled at different
> > layers
> > > > of:
> > > >   - table format
> > > >   - engine
> > > >   - catalog
> > > > to me.
> > > >
> > > >
> > > > Looking forward to your feedback! Also @Antoine, I put in a blurb
> > around
> > > > the extension framework in there. Would love your thoughts on that.
> > > >
> > > > Best,
> > > > Burak
> > > >
> > > >
> > > > On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]>
> > > > wrote:
> > > >
> > > >> I do think FILE would be good, even though it gets complicate fast.
> > > >>
> > > >> It'd be good to support all of
> > > >>
> > > >>    1. small inline blob somewhere within the parquet file (|data| =
> > > bytes)
> > > >>    2. Medium blob: data stored range limited within a larger file
> > > (|data|
> > > >> =
> > > >>    kilo to megabytes)
> > > >>    3. completely separate file (GB +), or somehow the data lifecycle
> > > isn't
> > > >>    managed with parquet file.
> > > >>
> > > >> Issues I can see
> > > >>
> > > >>    - lifecycle management you don't want to discover that your photo
> > > >>    collection has been deleted by accident, and a data rewrite such
> as
> > > >>    applying DVs shouldn't mandate rebuilding of external binary
> files.
> > > >>    - security, esp when providing credential access to tables.
> > > Credential
> > > >>    providers would also need to provide file access, so have to know
> > > which
> > > >>    binary files are associated with parquet files, somehow.
> > > >>
> > > >> What have other formats done here?
> > > >>
> > > >> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote:
> > > >>
> > > >> > For some reason, the original email never came through for me.
> This
> > > >> thread
> > > >> > starts with Rahil's email. In case other people are having the
> same
> > > >> > problem, here's the thread Burak is talking about:
> > > >> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
> > > >> >
> > > >> > Ryan
> > > >> >
> > > >> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]>
> > wrote:
> > > >> >
> > > >> > > I'll share something early next week. The original proposal is
> in
> > > the
> > > >> > first
> > > >> > > email in this thread.
> > > >> > >
> > > >> > > Best,
> > > >> > > Burak
> > > >> > >
> > > >> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer <
> > > >> [email protected]
> > > >> > >
> > > >> > > wrote:
> > > >> > >
> > > >> > > > Do we have a proposal for this yet? I'm excited to go over it
> > and
> > > I
> > > >> > > thought
> > > >> > > > one was mentioned in the last sync but I haven't seen it.
> > > >> > > >
> > > >> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > Hi all,
> > > >> > > > >
> > > >> > > > > Very sorry for the late reply, and thanks for the questions!
> > The
> > > >> > > messages
> > > >> > > > > were not landing in my inbox properly.
> > > >> > > > >
> > > >> > > > > @Antoine
> > > >> > > > > > I feel like this is the kind of use case where a
> > hypothetical
> > > >> > > extension
> > > >> > > > > type mechanism would be a better fit than hardcoding
> dedicated
> > > >> > logical
> > > >> > > > > types in the Thrift definition.
> > > >> > > > >
> > > >> > > > > How would that look like? We wanted to introduce this
> logical
> > > >> type to
> > > >> > > > > Parquet specifically, so that table formats such as Delta
> and
> > > >> Iceberg
> > > >> > > can
> > > >> > > > > have a simpler protocol change, and that we could provide
> this
> > > as
> > > >> a
> > > >> > > > > consistent format across multiple data processing engines.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > @Rahil
> > > >> > > > > > I wanted to better understand one point. Based on the
> > current
> > > >> spec
> > > >> > > you
> > > >> > > > > shared I see you have a parameter for the following:
> > > >> > > > > > > size INT64 -- the size of the file in bytes
> > > >> > > > > >  Are you proposing that the "File" type always writes the
> > > binary
> > > >> > > > content
> > > >> > > > > of
> > > >> > > > > something such as an image or video directly within the
> > Parquet
> > > >> file
> > > >> > > > (i.e.,
> > > >> > > > > "inlining")? Or would it make sense for the spec to have
> some
> > > >> field
> > > >> > > > > distinguishing whether to store the content's bytes in the
> > file
> > > >> > itself
> > > >> > > vs
> > > >> > > > > simply track a pointer to the actual file in storage (i.e.,
> > > >> keeping
> > > >> > it
> > > >> > > > "out
> > > >> > > > > of line").
> > > >> > > > >
> > > >> > > > > This is a great question. When it comes to FileType, the
> data
> > > will
> > > >> > > > > primarily be external to the parquet file, so the FileType
> > would
> > > >> just
> > > >> > > > store
> > > >> > > > > the pointer to the data.
> > > >> > > > > Now, can that data be inlined anyway? That is an
> optimization
> > > that
> > > >> > can
> > > >> > > > > certainly be done. However, that requires some benchmarks to
> > see
> > > >> how
> > > >> > > much
> > > >> > > > > the benefit would be.
> > > >> > > > > If compute engines were to carry this struct without any
> > column
> > > >> > pruning
> > > >> > > > > across all operations, having inline binary content would
> make
> > > >> > > operations
> > > >> > > > > like sorting and shuffling a lot more expensive.
> > > >> > > > > We couldn't instinctively justify whether this would be
> worth
> > it
> > > >> just
> > > >> > > > yet.
> > > >> > > > > However, the current proposed spec doesn't prevent you from
> > also
> > > >> > > storing
> > > >> > > > > the content inline side by side with the pointer
> information.
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]
> >
> > > >> wrote:
> > > >> > > > >
> > > >> > > > > > Hi Burak,
> > > >> > > > > >
> > > >> > > > > > Thanks for starting this discussion. I was also interested
> > in
> > > >> > raising
> > > >> > > > > this
> > > >> > > > > > topic within the Parquet community (unless it has already
> > been
> > > >> > > > discussed
> > > >> > > > > in
> > > >> > > > > > the past).
> > > >> > > > > > For users working with unstructured data today such as
> large
> > > >> text,
> > > >> > > > > images,
> > > >> > > > > > or video, a data type such as a "file" or "blob" would be
> > > >> useful.
> > > >> > > > > >
> > > >> > > > > > I wanted to better understand one point. Based on the
> > current
> > > >> spec
> > > >> > > you
> > > >> > > > > > shared I see you have a parameter for the following:
> > > >> > > > > > > size INT64 -- the size of the file in bytes
> > > >> > > > > >
> > > >> > > > > >  Are you proposing that the "File" type always writes the
> > > binary
> > > >> > > > content
> > > >> > > > > of
> > > >> > > > > > something such as an image or video directly within the
> > > Parquet
> > > >> > file
> > > >> > > > > (i.e.,
> > > >> > > > > > "inlining")? Or would it make sense for the spec to have
> > some
> > > >> field
> > > >> > > > > > distinguishing whether to store the content's bytes in the
> > > file
> > > >> > > itself
> > > >> > > > vs
> > > >> > > > > > simply track a pointer to the actual file in storage
> (i.e.,
> > > >> keeping
> > > >> > > it
> > > >> > > > > "out
> > > >> > > > > > of line"). I would assume there are use cases where you
> > would
> > > >> want
> > > >> > to
> > > >> > > > > store
> > > >> > > > > > the binary content of something, like a small image within
> > the
> > > >> > > Parquet
> > > >> > > > > file
> > > >> > > > > > instead of storing a pointer to a large video file in
> object
> > > >> > storage.
> > > >> > > > > >
> > > >> > > > > > Regards,
> > > >> > > > > > Rahil Chertara
> > > >> > > > > >
> > > >> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <
> > > >> [email protected]>
> > > >> > > > > wrote:
> > > >> > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Hello,
> > > >> > > > > > >
> > > >> > > > > > > I feel like this is the kind of use case where a
> > > hypothetical
> > > >> > > > extension
> > > >> > > > > > > type mechanism would be a better fit than hardcoding
> > > dedicated
> > > >> > > > logical
> > > >> > > > > > > types in the Thrift definition.
> > > >> > > > > > >
> > > >> > > > > > > Regards
> > > >> > > > > > >
> > > >> > > > > > > Antoine.
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> > > >> > > > > > > > Hello Parquet community,
> > > >> > > > > > > >
> > > >> > > > > > > > Unstructured data ingestion is getting extremely
> popular
> > > >> with
> > > >> > the
> > > >> > > > > > > advances
> > > >> > > > > > > > in Generative AI. Today, our only means of dealing
> with
> > > >> > > > unstructured
> > > >> > > > > > data
> > > >> > > > > > > > is to store it as a byte array inside Parquet, or
> point
> > to
> > > >> > files
> > > >> > > > that
> > > >> > > > > > > exist
> > > >> > > > > > > > in some object store with a string. These solutions
> fail
> > > to
> > > >> > > address
> > > >> > > > > > these
> > > >> > > > > > > > use cases, because of scalability, usability, and
> > > governance
> > > >> > > > issues.
> > > >> > > > > > > >
> > > >> > > > > > > > We would like to introduce a new logical type
> annotation
> > > in
> > > >> > > Parquet
> > > >> > > > > > > called
> > > >> > > > > > > > “File” for storing a struct that contains a path
> > reference
> > > >> to a
> > > >> > > > file
> > > >> > > > > > with
> > > >> > > > > > > > additional metadata.
> > > >> > > > > > > >
> > > >> > > > > > > > We propose that the struct contains the following
> > fields:
> > > >> > > > > > > >
> > > >> > > > > > > > path STRING NOT NULL -- the opaque path to a file
> > > >> > > > > > > >
> > > >> > > > > > > > size INT64 -- the size of the file in bytes
> > > >> > > > > > > >
> > > >> > > > > > > > content_type STRING       -- the mime/content type of
> > the
> > > >> file
> > > >> > > > > > > >
> > > >> > > > > > > > etag STRING -- the eTag identifier of the file. Can be
> > > used
> > > >> to
> > > >> > > > detect
> > > >> > > > > > > > changes to a
> > > >> > > > > > > >
> > > >> > > > > > > > -- file
> > > >> > > > > > > >
> > > >> > > > > > > > The path will be stored as an opaque string; whatever
> > the
> > > >> user
> > > >> > > > > > provides.
> > > >> > > > > > > We
> > > >> > > > > > > > don’t do any special encoding on it. The size will be
> > the
> > > >> size
> > > >> > of
> > > >> > > > the
> > > >> > > > > > > file
> > > >> > > > > > > > in bytes as long. We also store the content_type of
> the
> > > >> file,
> > > >> > and
> > > >> > > > its
> > > >> > > > > > > etag
> > > >> > > > > > > > .
> > > >> > > > > > > >
> > > >> > > > > > > > We believe that these set of options are bare-bones
> and
> > > can
> > > >> be
> > > >> > > > easily
> > > >> > > > > > > > extended by new optional fields in the future if
> desired
> > > >> that
> > > >> > > > > wouldn’t
> > > >> > > > > > > > impact the correctness of the file being read. We
> would
> > > >> like to
> > > >> > > > > > > introduce a
> > > >> > > > > > > > versioning field to the specification in case we need
> > new
> > > >> > fields
> > > >> > > in
> > > >> > > > > the
> > > >> > > > > > > > specification that may impact correctness, when
> > accessing
> > > a
> > > >> > file.
> > > >> > > > > > > >
> > > >> > > > > > > > We would represent this in parquet.thrift
> > > >> > > > > > > > <
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> > > >> > > > > > > >
> > > >> > > > > > > > as:
> > > >> > > > > > > >
> > > >> > > > > > > > /**
> > > >> > > > > > > >
> > > >> > > > > > > >   * File logical type annotation
> > > >> > > > > > > >
> > > >> > > > > > > >   */
> > > >> > > > > > > >
> > > >> > > > > > > > struct FileType {
> > > >> > > > > > > >
> > > >> > > > > > > >    // Versioning specification of the File struct
> > > contents.
> > > >> Can
> > > >> > > be
> > > >> > > > > used
> > > >> > > > > > > if a
> > > >> > > > > > > > new field is introduced to the
> > > >> > > > > > > >
> > > >> > > > > > > >    // struct representing the file, which may impact
> > > >> > correctness
> > > >> > > > when
> > > >> > > > > > > > accessing the file.
> > > >> > > > > > > >
> > > >> > > > > > > >    1: optional i8 specification_version
> > > >> > > > > > > >
> > > >> > > > > > > > }
> > > >> > > > > > > >
> > > >> > > > > > > > We believe that by natively supporting File references
> > in
> > > >> > > Parquet,
> > > >> > > > it
> > > >> > > > > > > will
> > > >> > > > > > > > become much simpler to build AI workloads on top of
> data
> > > >> stored
> > > >> > > in
> > > >> > > > > > > Parquet
> > > >> > > > > > > > across table formats and data processing engines.
> > Looking
> > > >> > forward
> > > >> > > > to
> > > >> > > > > > your
> > > >> > > > > > > > feedback!
> > > >> > > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > > >
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to