Re: [DISCUSS] Introducing a new “File” logical type to Parquet

Daniel Weeks Fri, 05 Jun 2026 09:01:54 -0700

Hey everyone,

I had an action item to follow up and provide more context based on the
short discussion during the sync (some is recap of what Burak already said
above).


I don't seem to have access to the video, so I can't provide a timestamped
link, but can share the high-level takeaways:

There was a fair bit of discussion back and forth in the doc around some of
the fields (especially metadata and content_type).  In the end, what I feel
resonated most with everyone is that if we're creating new primitive types,
we should define them as narrowly as possible (don't include a bunch of
extra fields with hypothetical use cases).  We also looked across other
implementations, and while there was some variation, Burak's updated
proposal seems consistent where most of the representations.

If users want to include additional information, it makes more sense to
carry that information in neighboring fields as it quickly shifts to more
specific use cases.

Thanks Burak for the quick turnaround!

-Dan

On Fri, Jun 5, 2026 at 8:37 AM Micah Kornfield <[email protected]>
wrote:

> >
> > If there are no strong arguments against the current proposal, may I
> follow
> > up with a pull request to apache/parquet-format
> > <https://github.com/apache/parquet-format>? What would be the next
> steps?
> > Or would I need to start a vote first?
>
> Hi Burak,
> New feature steps are listed in the format contributors guide [1].  If
> there are no objections we can move to step 2 (completeness): A PR against
> parquet-format and updates to the reference implementations (hopefully
> these are pretty trivial for this case).
>
> I think we can probably start the PRs next week to give people a chance to
> digest the current proposal and speakup if there are hard objections.
>
> Cheers,
> Micah
>
>
> [1]
>
> https://github.com/apache/parquet-format/blob/master/CONTRIBUTING.md#additionschanges-to-the-format
>
>
> On Fri, Jun 5, 2026 at 8:25 AM Burak Yavuz <[email protected]> wrote:
>
> > Hi all,
> >
> > Thank you all for the great discussion on the document! I made another
> pass
> > on the doc. During the Parquet sync, there was alignment around keeping
> the
> > field as simple and minimalistic as possible. I updated the doc in that
> way
> > (removed content_type from the field) to ensure that the fields available
> > are all functional fields for correctly reading a file.
> >
> > Please let me know if you have more feedback!
> >
> > If there are no strong arguments against the current proposal, may I
> follow
> > up with a pull request to apache/parquet-format
> > <https://github.com/apache/parquet-format>? What would be the next
> steps?
> > Or would I need to start a vote first?
> >
> > Thanks,
> > Burak
> >
> > On Wed, May 27, 2026 at 10:31 AM Burak Yavuz <[email protected]> wrote:
> >
> > > Hello all,
> > >
> > > I'm sharing the design document for File Type here
> > > <
> >
> https://docs.google.com/document/d/1AiwrstqkwkBoOZqgOkm9JGwSMcNeHyLR7EEj1CVqpZQ/edit?usp=sharing
> > >.
> > > Please let me know what you think!
> > > Wanted to thank Micah Kornfield, Divjot Arora, and Daniel Weeks for
> their
> > > feedback working on this document.
> > >
> > > Steve, regarding your questions, my thoughts are inline:
> > > >    1. small inline blob somewhere within the parquet file (|data| =
> > > bytes)
> > > We have a lot of design options here. Does it need to be part of
> "File"?
> > > That's debatable. Engines/table formats can decide to coalesce a File
> > > reference with an inline value when available for example. Carrying an
> > > inline binary blob may make analytics workloads more inefficient,
> > > specifically if you have to carry them around as baggage through sorts
> > and
> > > shuffles.
> > >
> > > > 2. Medium blob: data stored range limited within a larger file
> (|data|
> > =
> > >    kilo to megabytes)
> > > Again, can be up to a table format to decide creating sidecar files,
> > where
> > > the sidecar may be built on top of these file references.
> > >
> > > > 3. completely separate file (GB +), or somehow the data lifecycle
> isn't
> > >    managed with parquet file.
> > >
> > > This file reference solves this problem as well.
> > >
> > > > lifecycle management you don't want to discover that your photo
> > >    collection has been deleted by accident, and a data rewrite such as
> > >    applying DVs shouldn't mandate rebuilding of external binary files.
> > > > security, esp when providing credential access to tables. Credential
> > >    providers would also need to provide file access, so have to know
> > which
> > >    binary files are associated with parquet files, somehow.
> > >
> > > These all sound like problems that should be handled at different
> layers
> > > of:
> > >   - table format
> > >   - engine
> > >   - catalog
> > > to me.
> > >
> > >
> > > Looking forward to your feedback! Also @Antoine, I put in a blurb
> around
> > > the extension framework in there. Would love your thoughts on that.
> > >
> > > Best,
> > > Burak
> > >
> > >
> > > On Wed, May 27, 2026 at 3:09 AM Steve Loughran <[email protected]>
> > > wrote:
> > >
> > >> I do think FILE would be good, even though it gets complicate fast.
> > >>
> > >> It'd be good to support all of
> > >>
> > >>    1. small inline blob somewhere within the parquet file (|data| =
> > bytes)
> > >>    2. Medium blob: data stored range limited within a larger file
> > (|data|
> > >> =
> > >>    kilo to megabytes)
> > >>    3. completely separate file (GB +), or somehow the data lifecycle
> > isn't
> > >>    managed with parquet file.
> > >>
> > >> Issues I can see
> > >>
> > >>    - lifecycle management you don't want to discover that your photo
> > >>    collection has been deleted by accident, and a data rewrite such as
> > >>    applying DVs shouldn't mandate rebuilding of external binary files.
> > >>    - security, esp when providing credential access to tables.
> > Credential
> > >>    providers would also need to provide file access, so have to know
> > which
> > >>    binary files are associated with parquet files, somehow.
> > >>
> > >> What have other formats done here?
> > >>
> > >> On Thu, 21 May 2026 at 22:13, Ryan Blue <[email protected]> wrote:
> > >>
> > >> > For some reason, the original email never came through for me. This
> > >> thread
> > >> > starts with Rahil's email. In case other people are having the same
> > >> > problem, here's the thread Burak is talking about:
> > >> > https://lists.apache.org/thread/od9hxfssjgnmsh23o18q78hszowq7pcy
> > >> >
> > >> > Ryan
> > >> >
> > >> > On Thu, May 21, 2026 at 1:30 PM Burak Yavuz <[email protected]>
> wrote:
> > >> >
> > >> > > I'll share something early next week. The original proposal is in
> > the
> > >> > first
> > >> > > email in this thread.
> > >> > >
> > >> > > Best,
> > >> > > Burak
> > >> > >
> > >> > > On Thu, May 21, 2026, 1:15 PM Russell Spitzer <
> > >> [email protected]
> > >> > >
> > >> > > wrote:
> > >> > >
> > >> > > > Do we have a proposal for this yet? I'm excited to go over it
> and
> > I
> > >> > > thought
> > >> > > > one was mentioned in the last sync but I haven't seen it.
> > >> > > >
> > >> > > > On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]>
> > >> wrote:
> > >> > > >
> > >> > > > > Hi all,
> > >> > > > >
> > >> > > > > Very sorry for the late reply, and thanks for the questions!
> The
> > >> > > messages
> > >> > > > > were not landing in my inbox properly.
> > >> > > > >
> > >> > > > > @Antoine
> > >> > > > > > I feel like this is the kind of use case where a
> hypothetical
> > >> > > extension
> > >> > > > > type mechanism would be a better fit than hardcoding dedicated
> > >> > logical
> > >> > > > > types in the Thrift definition.
> > >> > > > >
> > >> > > > > How would that look like? We wanted to introduce this logical
> > >> type to
> > >> > > > > Parquet specifically, so that table formats such as Delta and
> > >> Iceberg
> > >> > > can
> > >> > > > > have a simpler protocol change, and that we could provide this
> > as
> > >> a
> > >> > > > > consistent format across multiple data processing engines.
> > >> > > > >
> > >> > > > >
> > >> > > > > @Rahil
> > >> > > > > > I wanted to better understand one point. Based on the
> current
> > >> spec
> > >> > > you
> > >> > > > > shared I see you have a parameter for the following:
> > >> > > > > > > size INT64 -- the size of the file in bytes
> > >> > > > > >  Are you proposing that the "File" type always writes the
> > binary
> > >> > > > content
> > >> > > > > of
> > >> > > > > something such as an image or video directly within the
> Parquet
> > >> file
> > >> > > > (i.e.,
> > >> > > > > "inlining")? Or would it make sense for the spec to have some
> > >> field
> > >> > > > > distinguishing whether to store the content's bytes in the
> file
> > >> > itself
> > >> > > vs
> > >> > > > > simply track a pointer to the actual file in storage (i.e.,
> > >> keeping
> > >> > it
> > >> > > > "out
> > >> > > > > of line").
> > >> > > > >
> > >> > > > > This is a great question. When it comes to FileType, the data
> > will
> > >> > > > > primarily be external to the parquet file, so the FileType
> would
> > >> just
> > >> > > > store
> > >> > > > > the pointer to the data.
> > >> > > > > Now, can that data be inlined anyway? That is an optimization
> > that
> > >> > can
> > >> > > > > certainly be done. However, that requires some benchmarks to
> see
> > >> how
> > >> > > much
> > >> > > > > the benefit would be.
> > >> > > > > If compute engines were to carry this struct without any
> column
> > >> > pruning
> > >> > > > > across all operations, having inline binary content would make
> > >> > > operations
> > >> > > > > like sorting and shuffling a lot more expensive.
> > >> > > > > We couldn't instinctively justify whether this would be worth
> it
> > >> just
> > >> > > > yet.
> > >> > > > > However, the current proposed spec doesn't prevent you from
> also
> > >> > > storing
> > >> > > > > the content inline side by side with the pointer information.
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]>
> > >> wrote:
> > >> > > > >
> > >> > > > > > Hi Burak,
> > >> > > > > >
> > >> > > > > > Thanks for starting this discussion. I was also interested
> in
> > >> > raising
> > >> > > > > this
> > >> > > > > > topic within the Parquet community (unless it has already
> been
> > >> > > > discussed
> > >> > > > > in
> > >> > > > > > the past).
> > >> > > > > > For users working with unstructured data today such as large
> > >> text,
> > >> > > > > images,
> > >> > > > > > or video, a data type such as a "file" or "blob" would be
> > >> useful.
> > >> > > > > >
> > >> > > > > > I wanted to better understand one point. Based on the
> current
> > >> spec
> > >> > > you
> > >> > > > > > shared I see you have a parameter for the following:
> > >> > > > > > > size INT64 -- the size of the file in bytes
> > >> > > > > >
> > >> > > > > >  Are you proposing that the "File" type always writes the
> > binary
> > >> > > > content
> > >> > > > > of
> > >> > > > > > something such as an image or video directly within the
> > Parquet
> > >> > file
> > >> > > > > (i.e.,
> > >> > > > > > "inlining")? Or would it make sense for the spec to have
> some
> > >> field
> > >> > > > > > distinguishing whether to store the content's bytes in the
> > file
> > >> > > itself
> > >> > > > vs
> > >> > > > > > simply track a pointer to the actual file in storage (i.e.,
> > >> keeping
> > >> > > it
> > >> > > > > "out
> > >> > > > > > of line"). I would assume there are use cases where you
> would
> > >> want
> > >> > to
> > >> > > > > store
> > >> > > > > > the binary content of something, like a small image within
> the
> > >> > > Parquet
> > >> > > > > file
> > >> > > > > > instead of storing a pointer to a large video file in object
> > >> > storage.
> > >> > > > > >
> > >> > > > > > Regards,
> > >> > > > > > Rahil Chertara
> > >> > > > > >
> > >> > > > > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <
> > >> [email protected]>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > >
> > >> > > > > > > Hello,
> > >> > > > > > >
> > >> > > > > > > I feel like this is the kind of use case where a
> > hypothetical
> > >> > > > extension
> > >> > > > > > > type mechanism would be a better fit than hardcoding
> > dedicated
> > >> > > > logical
> > >> > > > > > > types in the Thrift definition.
> > >> > > > > > >
> > >> > > > > > > Regards
> > >> > > > > > >
> > >> > > > > > > Antoine.
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> > >> > > > > > > > Hello Parquet community,
> > >> > > > > > > >
> > >> > > > > > > > Unstructured data ingestion is getting extremely popular
> > >> with
> > >> > the
> > >> > > > > > > advances
> > >> > > > > > > > in Generative AI. Today, our only means of dealing with
> > >> > > > unstructured
> > >> > > > > > data
> > >> > > > > > > > is to store it as a byte array inside Parquet, or point
> to
> > >> > files
> > >> > > > that
> > >> > > > > > > exist
> > >> > > > > > > > in some object store with a string. These solutions fail
> > to
> > >> > > address
> > >> > > > > > these
> > >> > > > > > > > use cases, because of scalability, usability, and
> > governance
> > >> > > > issues.
> > >> > > > > > > >
> > >> > > > > > > > We would like to introduce a new logical type annotation
> > in
> > >> > > Parquet
> > >> > > > > > > called
> > >> > > > > > > > “File” for storing a struct that contains a path
> reference
> > >> to a
> > >> > > > file
> > >> > > > > > with
> > >> > > > > > > > additional metadata.
> > >> > > > > > > >
> > >> > > > > > > > We propose that the struct contains the following
> fields:
> > >> > > > > > > >
> > >> > > > > > > > path STRING NOT NULL -- the opaque path to a file
> > >> > > > > > > >
> > >> > > > > > > > size INT64 -- the size of the file in bytes
> > >> > > > > > > >
> > >> > > > > > > > content_type STRING       -- the mime/content type of
> the
> > >> file
> > >> > > > > > > >
> > >> > > > > > > > etag STRING -- the eTag identifier of the file. Can be
> > used
> > >> to
> > >> > > > detect
> > >> > > > > > > > changes to a
> > >> > > > > > > >
> > >> > > > > > > > -- file
> > >> > > > > > > >
> > >> > > > > > > > The path will be stored as an opaque string; whatever
> the
> > >> user
> > >> > > > > > provides.
> > >> > > > > > > We
> > >> > > > > > > > don’t do any special encoding on it. The size will be
> the
> > >> size
> > >> > of
> > >> > > > the
> > >> > > > > > > file
> > >> > > > > > > > in bytes as long. We also store the content_type of the
> > >> file,
> > >> > and
> > >> > > > its
> > >> > > > > > > etag
> > >> > > > > > > > .
> > >> > > > > > > >
> > >> > > > > > > > We believe that these set of options are bare-bones and
> > can
> > >> be
> > >> > > > easily
> > >> > > > > > > > extended by new optional fields in the future if desired
> > >> that
> > >> > > > > wouldn’t
> > >> > > > > > > > impact the correctness of the file being read. We would
> > >> like to
> > >> > > > > > > introduce a
> > >> > > > > > > > versioning field to the specification in case we need
> new
> > >> > fields
> > >> > > in
> > >> > > > > the
> > >> > > > > > > > specification that may impact correctness, when
> accessing
> > a
> > >> > file.
> > >> > > > > > > >
> > >> > > > > > > > We would represent this in parquet.thrift
> > >> > > > > > > > <
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> > >> > > > > > > >
> > >> > > > > > > > as:
> > >> > > > > > > >
> > >> > > > > > > > /**
> > >> > > > > > > >
> > >> > > > > > > >   * File logical type annotation
> > >> > > > > > > >
> > >> > > > > > > >   */
> > >> > > > > > > >
> > >> > > > > > > > struct FileType {
> > >> > > > > > > >
> > >> > > > > > > >    // Versioning specification of the File struct
> > contents.
> > >> Can
> > >> > > be
> > >> > > > > used
> > >> > > > > > > if a
> > >> > > > > > > > new field is introduced to the
> > >> > > > > > > >
> > >> > > > > > > >    // struct representing the file, which may impact
> > >> > correctness
> > >> > > > when
> > >> > > > > > > > accessing the file.
> > >> > > > > > > >
> > >> > > > > > > >    1: optional i8 specification_version
> > >> > > > > > > >
> > >> > > > > > > > }
> > >> > > > > > > >
> > >> > > > > > > > We believe that by natively supporting File references
> in
> > >> > > Parquet,
> > >> > > > it
> > >> > > > > > > will
> > >> > > > > > > > become much simpler to build AI workloads on top of data
> > >> stored
> > >> > > in
> > >> > > > > > > Parquet
> > >> > > > > > > > across table formats and data processing engines.
> Looking
> > >> > forward
> > >> > > > to
> > >> > > > > > your
> > >> > > > > > > > feedback!
> > >> > > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > > >
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: [DISCUSS] Introducing a new “File” logical type to Parquet

Reply via email to