>
> If there are no strong arguments against the current proposal, may I follow
> up with a pull request to apache/parquet-format
> <https://github.com/apache/parquet-format>? What would be the next steps?
> Or would I need to start a vote first?

Hi Burak,
New feature steps are listed in the format contributors guide [1].  If
there are no objections we can move to step 2 (completeness): A PR against
parquet-format and updates to the reference implementations (hopefully
these are pretty trivial for this case).

I think we can probably start the PRs next week to give people a chance to
digest the current proposal and speakup if there are hard objections.

Cheers,
Micah


[1]
https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format


On Fri, Jun 5, 2026 at 8:25 AM Burak Yavuz <[email protected]> wrote:

> Hi all,
>
> Thank you all for the great discussion on the document! I made another pass
> on the doc. During the Parquet sync, there was alignment around keeping the
> field as simple and minimalistic as possible. I updated the doc in that way
> (removed content_type from the field) to ensure that the fields available
> are all functional fields for correctly reading a file.
>
> Please let me know if you have more feedback!
>
> If there are no strong arguments against the current proposal, may I follow
> up with a pull request to apache/parquet-format
> <https://github.com/apache/parquet-format>? What would be the next steps?
> Or would I need to start a vote first?
>
> Thanks,
> Burak
>
> On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote:
>
> > Hello all,
> >
> > I'm sharing the design document for File Type here
> > <
> https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing
> >.
> > Please let me know what you think!
> > Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for their
> > feedback working on this document.
> >
> > Steve, regarding your questions, my thoughts are inline:
> > >    1. small inline blob somewhere within the parquet file (|data| =
> > bytes)
> > We have a lot of design options here. Does it need to be part of "File"?
> > That's debatable. Engines/table formats can decide to coalesce a File
> > reference with an inline value when available for example. Carrying an
> > inline binary blob may make analytics workloads more inefficient,
> > specifically if you have to carry them around as baggage through sorts
> and
> > shuffles.
> >
> > > 2. Medium blob: data stored range limited within a larger file (|data|
> =
> >    kilo to megabytes)
> > Again, can be up to a table format to decide creating sidecar files,
> where
> > the sidecar may be built on top of these file references.
> >
> > > 3. completely separate file (GB +), or somehow the data lifecycle isn't
> >    managed with parquet file.
> >
> > This file reference solves this problem as well.
> >
> > > lifecycle management you don't want to discover that your photo
> >    collection has been deleted by accident, and a data rewrite such as
> >    applying DVs shouldn't mandate rebuilding of external binary files.
> > > security, esp when providing credential access to tables. Credential
> >    providers would also need to provide file access, so have to know
> which
> >    binary files are associated with parquet files, somehow.
> >
> > These all sound like problems that should be handled at different layers
> > of:
> >   - table format
> >   - engine
> >   - catalog
> > to me.
> >
> >
> > Looking forward to your feedback! Also @Antoine, I put in a blurb around
> > the extension framework in there. Would love your thoughts on that.
> >
> > Best,
> > Burak
> >
> >
> > On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]>
> > wrote:
> >
> >> I do think FILE would be good, even though it gets complicate fast.
> >>
> >> It'd be good to support all of
> >>
> >>    1. small inline blob somewhere within the parquet file (|data| =
> bytes)
> >>    2. Medium blob: data stored range limited within a larger file
> (|data|
> >> =
> >>    kilo to megabytes)
> >>    3. completely separate file (GB +), or somehow the data lifecycle
> isn't
> >>    managed with parquet file.
> >>
> >> Issues I can see
> >>
> >>    - lifecycle management you don't want to discover that your photo
> >>    collection has been deleted by accident, and a data rewrite such as
> >>    applying DVs shouldn't mandate rebuilding of external binary files.
> >>    - security, esp when providing credential access to tables.
> Credential
> >>    providers would also need to provide file access, so have to know
> which
> >>    binary files are associated with parquet files, somehow.
> >>
> >> What have other formats done here?
> >>
> >> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote:
> >>
> >> > For some reason, the original email never came through for me. This
> >> thread
> >> > starts with Rahil's email. In case other people are having the same
> >> > problem, here's the thread Burak is talking about:
> >> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
> >> >
> >> > Ryan
> >> >
> >> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]> wrote:
> >> >
> >> > > I'll share something early next week. The original proposal is in
> the
> >> > first
> >> > > email in this thread.
> >> > >
> >> > > Best,
> >> > > Burak
> >> > >
> >> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer <
> >> [email protected]
> >> > >
> >> > > wrote:
> >> > >
> >> > > > Do we have a proposal for this yet? I'm excited to go over it and
> I
> >> > > thought
> >> > > > one was mentioned in the last sync but I haven't seen it.
> >> > > >
> >> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]>
> >> wrote:
> >> > > >
> >> > > > > Hi all,
> >> > > > >
> >> > > > > Very sorry for the late reply, and thanks for the questions! The
> >> > > messages
> >> > > > > were not landing in my inbox properly.
> >> > > > >
> >> > > > > @Antoine
> >> > > > > > I feel like this is the kind of use case where a hypothetical
> >> > > extension
> >> > > > > type mechanism would be a better fit than hardcoding dedicated
> >> > logical
> >> > > > > types in the Thrift definition.
> >> > > > >
> >> > > > > How would that look like? We wanted to introduce this logical
> >> type to
> >> > > > > Parquet specifically, so that table formats such as Delta and
> >> Iceberg
> >> > > can
> >> > > > > have a simpler protocol change, and that we could provide this
> as
> >> a
> >> > > > > consistent format across multiple data processing engines.
> >> > > > >
> >> > > > >
> >> > > > > @Rahil
> >> > > > > > I wanted to better understand one point. Based on the current
> >> spec
> >> > > you
> >> > > > > shared I see you have a parameter for the following:
> >> > > > > > > size INT64 -- the size of the file in bytes
> >> > > > > >  Are you proposing that the "File" type always writes the
> binary
> >> > > > content
> >> > > > > of
> >> > > > > something such as an image or video directly within the Parquet
> >> file
> >> > > > (i.e.,
> >> > > > > "inlining")? Or would it make sense for the spec to have some
> >> field
> >> > > > > distinguishing whether to store the content's bytes in the file
> >> > itself
> >> > > vs
> >> > > > > simply track a pointer to the actual file in storage (i.e.,
> >> keeping
> >> > it
> >> > > > "out
> >> > > > > of line").
> >> > > > >
> >> > > > > This is a great question. When it comes to FileType, the data
> will
> >> > > > > primarily be external to the parquet file, so the FileType would
> >> just
> >> > > > store
> >> > > > > the pointer to the data.
> >> > > > > Now, can that data be inlined anyway? That is an optimization
> that
> >> > can
> >> > > > > certainly be done. However, that requires some benchmarks to see
> >> how
> >> > > much
> >> > > > > the benefit would be.
> >> > > > > If compute engines were to carry this struct without any column
> >> > pruning
> >> > > > > across all operations, having inline binary content would make
> >> > > operations
> >> > > > > like sorting and shuffling a lot more expensive.
> >> > > > > We couldn't instinctively justify whether this would be worth it
> >> just
> >> > > > yet.
> >> > > > > However, the current proposed spec doesn't prevent you from also
> >> > > storing
> >> > > > > the content inline side by side with the pointer information.
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]>
> >> wrote:
> >> > > > >
> >> > > > > > Hi Burak,
> >> > > > > >
> >> > > > > > Thanks for starting this discussion. I was also interested in
> >> > raising
> >> > > > > this
> >> > > > > > topic within the Parquet community (unless it has already been
> >> > > > discussed
> >> > > > > in
> >> > > > > > the past).
> >> > > > > > For users working with unstructured data today such as large
> >> text,
> >> > > > > images,
> >> > > > > > or video, a data type such as a "file" or "blob" would be
> >> useful.
> >> > > > > >
> >> > > > > > I wanted to better understand one point. Based on the current
> >> spec
> >> > > you
> >> > > > > > shared I see you have a parameter for the following:
> >> > > > > > > size INT64 -- the size of the file in bytes
> >> > > > > >
> >> > > > > >  Are you proposing that the "File" type always writes the
> binary
> >> > > > content
> >> > > > > of
> >> > > > > > something such as an image or video directly within the
> Parquet
> >> > file
> >> > > > > (i.e.,
> >> > > > > > "inlining")? Or would it make sense for the spec to have some
> >> field
> >> > > > > > distinguishing whether to store the content's bytes in the
> file
> >> > > itself
> >> > > > vs
> >> > > > > > simply track a pointer to the actual file in storage (i.e.,
> >> keeping
> >> > > it
> >> > > > > "out
> >> > > > > > of line"). I would assume there are use cases where you would
> >> want
> >> > to
> >> > > > > store
> >> > > > > > the binary content of something, like a small image within the
> >> > > Parquet
> >> > > > > file
> >> > > > > > instead of storing a pointer to a large video file in object
> >> > storage.
> >> > > > > >
> >> > > > > > Regards,
> >> > > > > > Rahil Chertara
> >> > > > > >
> >> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <
> >> [email protected]>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > >
> >> > > > > > > Hello,
> >> > > > > > >
> >> > > > > > > I feel like this is the kind of use case where a
> hypothetical
> >> > > > extension
> >> > > > > > > type mechanism would be a better fit than hardcoding
> dedicated
> >> > > > logical
> >> > > > > > > types in the Thrift definition.
> >> > > > > > >
> >> > > > > > > Regards
> >> > > > > > >
> >> > > > > > > Antoine.
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> >> > > > > > > > Hello Parquet community,
> >> > > > > > > >
> >> > > > > > > > Unstructured data ingestion is getting extremely popular
> >> with
> >> > the
> >> > > > > > > advances
> >> > > > > > > > in Generative AI. Today, our only means of dealing with
> >> > > > unstructured
> >> > > > > > data
> >> > > > > > > > is to store it as a byte array inside Parquet, or point to
> >> > files
> >> > > > that
> >> > > > > > > exist
> >> > > > > > > > in some object store with a string. These solutions fail
> to
> >> > > address
> >> > > > > > these
> >> > > > > > > > use cases, because of scalability, usability, and
> governance
> >> > > > issues.
> >> > > > > > > >
> >> > > > > > > > We would like to introduce a new logical type annotation
> in
> >> > > Parquet
> >> > > > > > > called
> >> > > > > > > > “File” for storing a struct that contains a path reference
> >> to a
> >> > > > file
> >> > > > > > with
> >> > > > > > > > additional metadata.
> >> > > > > > > >
> >> > > > > > > > We propose that the struct contains the following fields:
> >> > > > > > > >
> >> > > > > > > > path STRING NOT NULL -- the opaque path to a file
> >> > > > > > > >
> >> > > > > > > > size INT64 -- the size of the file in bytes
> >> > > > > > > >
> >> > > > > > > > content_type STRING       -- the mime/content type of the
> >> file
> >> > > > > > > >
> >> > > > > > > > etag STRING -- the eTag identifier of the file. Can be
> used
> >> to
> >> > > > detect
> >> > > > > > > > changes to a
> >> > > > > > > >
> >> > > > > > > > -- file
> >> > > > > > > >
> >> > > > > > > > The path will be stored as an opaque string; whatever the
> >> user
> >> > > > > > provides.
> >> > > > > > > We
> >> > > > > > > > don’t do any special encoding on it. The size will be the
> >> size
> >> > of
> >> > > > the
> >> > > > > > > file
> >> > > > > > > > in bytes as long. We also store the content_type of the
> >> file,
> >> > and
> >> > > > its
> >> > > > > > > etag
> >> > > > > > > > .
> >> > > > > > > >
> >> > > > > > > > We believe that these set of options are bare-bones and
> can
> >> be
> >> > > > easily
> >> > > > > > > > extended by new optional fields in the future if desired
> >> that
> >> > > > > wouldn’t
> >> > > > > > > > impact the correctness of the file being read. We would
> >> like to
> >> > > > > > > introduce a
> >> > > > > > > > versioning field to the specification in case we need new
> >> > fields
> >> > > in
> >> > > > > the
> >> > > > > > > > specification that may impact correctness, when accessing
> a
> >> > file.
> >> > > > > > > >
> >> > > > > > > > We would represent this in parquet.thrift
> >> > > > > > > > <
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> >> > > > > > > >
> >> > > > > > > > as:
> >> > > > > > > >
> >> > > > > > > > /**
> >> > > > > > > >
> >> > > > > > > >   * File logical type annotation
> >> > > > > > > >
> >> > > > > > > >   */
> >> > > > > > > >
> >> > > > > > > > struct FileType {
> >> > > > > > > >
> >> > > > > > > >    // Versioning specification of the File struct
> contents.
> >> Can
> >> > > be
> >> > > > > used
> >> > > > > > > if a
> >> > > > > > > > new field is introduced to the
> >> > > > > > > >
> >> > > > > > > >    // struct representing the file, which may impact
> >> > correctness
> >> > > > when
> >> > > > > > > > accessing the file.
> >> > > > > > > >
> >> > > > > > > >    1: optional i8 specification_version
> >> > > > > > > >
> >> > > > > > > > }
> >> > > > > > > >
> >> > > > > > > > We believe that by natively supporting File references in
> >> > > Parquet,
> >> > > > it
> >> > > > > > > will
> >> > > > > > > > become much simpler to build AI workloads on top of data
> >> stored
> >> > > in
> >> > > > > > > Parquet
> >> > > > > > > > across table formats and data processing engines. Looking
> >> > forward
> >> > > > to
> >> > > > > > your
> >> > > > > > > > feedback!
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Reply via email to