I'll share something early next week. The original proposal is in the first
email in this thread.

Best,
Burak

On Thu, May 21, 2026, 1:15 PM Russell Spitzer <[email protected]>
wrote:

> Do we have a proposal for this yet? I'm excited to go over it and I thought
> one was mentioned in the last sync but I haven't seen it.
>
> On Wed, Apr 8, 2026 at 1:33 PM Burak Yavuz <[email protected]> wrote:
>
> > Hi all,
> >
> > Very sorry for the late reply, and thanks for the questions! The messages
> > were not landing in my inbox properly.
> >
> > @Antoine
> > > I feel like this is the kind of use case where a hypothetical extension
> > type mechanism would be a better fit than hardcoding dedicated logical
> > types in the Thrift definition.
> >
> > How would that look like? We wanted to introduce this logical type to
> > Parquet specifically, so that table formats such as Delta and Iceberg can
> > have a simpler protocol change, and that we could provide this as a
> > consistent format across multiple data processing engines.
> >
> >
> > @Rahil
> > > I wanted to better understand one point. Based on the current spec you
> > shared I see you have a parameter for the following:
> > > > size INT64 -- the size of the file in bytes
> > >  Are you proposing that the "File" type always writes the binary
> content
> > of
> > something such as an image or video directly within the Parquet file
> (i.e.,
> > "inlining")? Or would it make sense for the spec to have some field
> > distinguishing whether to store the content's bytes in the file itself vs
> > simply track a pointer to the actual file in storage (i.e., keeping it
> "out
> > of line").
> >
> > This is a great question. When it comes to FileType, the data will
> > primarily be external to the parquet file, so the FileType would just
> store
> > the pointer to the data.
> > Now, can that data be inlined anyway? That is an optimization that can
> > certainly be done. However, that requires some benchmarks to see how much
> > the benefit would be.
> > If compute engines were to carry this struct without any column pruning
> > across all operations, having inline binary content would make operations
> > like sorting and shuffling a lot more expensive.
> > We couldn't instinctively justify whether this would be worth it just
> yet.
> > However, the current proposed spec doesn't prevent you from also storing
> > the content inline side by side with the pointer information.
> >
> >
> >
> > On Sun, Mar 8, 2026 at 5:54 PM Rahil C <[email protected]> wrote:
> >
> > > Hi Burak,
> > >
> > > Thanks for starting this discussion. I was also interested in raising
> > this
> > > topic within the Parquet community (unless it has already been
> discussed
> > in
> > > the past).
> > > For users working with unstructured data today such as large text,
> > images,
> > > or video, a data type such as a "file" or "blob" would be useful.
> > >
> > > I wanted to better understand one point. Based on the current spec you
> > > shared I see you have a parameter for the following:
> > > > size INT64 -- the size of the file in bytes
> > >
> > >  Are you proposing that the "File" type always writes the binary
> content
> > of
> > > something such as an image or video directly within the Parquet file
> > (i.e.,
> > > "inlining")? Or would it make sense for the spec to have some field
> > > distinguishing whether to store the content's bytes in the file itself
> vs
> > > simply track a pointer to the actual file in storage (i.e., keeping it
> > "out
> > > of line"). I would assume there are use cases where you would want to
> > store
> > > the binary content of something, like a small image within the Parquet
> > file
> > > instead of storing a pointer to a large video file in object storage.
> > >
> > > Regards,
> > > Rahil Chertara
> > >
> > > On Sat, Mar 7, 2026 at 1:19 AM Antoine Pitrou <[email protected]>
> > wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > I feel like this is the kind of use case where a hypothetical
> extension
> > > > type mechanism would be a better fit than hardcoding dedicated
> logical
> > > > types in the Thrift definition.
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > > > Le 07/03/2026 à 01:57, Burak Yavuz a écrit :
> > > > > Hello Parquet community,
> > > > >
> > > > > Unstructured data ingestion is getting extremely popular with the
> > > > advances
> > > > > in Generative AI. Today, our only means of dealing with
> unstructured
> > > data
> > > > > is to store it as a byte array inside Parquet, or point to files
> that
> > > > exist
> > > > > in some object store with a string. These solutions fail to address
> > > these
> > > > > use cases, because of scalability, usability, and governance
> issues.
> > > > >
> > > > > We would like to introduce a new logical type annotation in Parquet
> > > > called
> > > > > “File” for storing a struct that contains a path reference to a
> file
> > > with
> > > > > additional metadata.
> > > > >
> > > > > We propose that the struct contains the following fields:
> > > > >
> > > > > path STRING NOT NULL -- the opaque path to a file
> > > > >
> > > > > size INT64 -- the size of the file in bytes
> > > > >
> > > > > content_type STRING       -- the mime/content type of the file
> > > > >
> > > > > etag STRING -- the eTag identifier of the file. Can be used to
> detect
> > > > > changes to a
> > > > >
> > > > > -- file
> > > > >
> > > > > The path will be stored as an opaque string; whatever the user
> > > provides.
> > > > We
> > > > > don’t do any special encoding on it. The size will be the size of
> the
> > > > file
> > > > > in bytes as long. We also store the content_type of the file, and
> its
> > > > etag
> > > > > .
> > > > >
> > > > > We believe that these set of options are bare-bones and can be
> easily
> > > > > extended by new optional fields in the future if desired that
> > wouldn’t
> > > > > impact the correctness of the file being read. We would like to
> > > > introduce a
> > > > > versioning field to the specification in case we need new fields in
> > the
> > > > > specification that may impact correctness, when accessing a file.
> > > > >
> > > > > We would represent this in parquet.thrift
> > > > > <
> > > >
> > >
> >
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift
> > > > >
> > > > > as:
> > > > >
> > > > > /**
> > > > >
> > > > >   * File logical type annotation
> > > > >
> > > > >   */
> > > > >
> > > > > struct FileType {
> > > > >
> > > > >    // Versioning specification of the File struct contents. Can be
> > used
> > > > if a
> > > > > new field is introduced to the
> > > > >
> > > > >    // struct representing the file, which may impact correctness
> when
> > > > > accessing the file.
> > > > >
> > > > >    1: optional i8 specification_version
> > > > >
> > > > > }
> > > > >
> > > > > We believe that by natively supporting File references in Parquet,
> it
> > > > will
> > > > > become much simpler to build AI workloads on top of data stored in
> > > > Parquet
> > > > > across table formats and data processing engines. Looking forward
> to
> > > your
> > > > > feedback!
> > > > >
> > > >
> > > >
> > > >
> > >
> >
>

Reply via email to