IMO Iceberg needs to be aware of Parquet files referencing others so that it can prune older snapshots correctly and not delete parquet files referenced by others when deleting old snapshots. Depending if this is a cross table or within table file to file reference could make it more or less complicated.
I could imagine starting with a simple implementation for the write path: "Create table foo using parquet options (reference_original_columns = true) as Select content, extract(content) as metadata from bar" That would be constrained to simple plans that have the scan and the output in the same step (map only) so that rows are in the same order per file. Alternatively "alter table foo add column metadata OPTIONS (column_familly = 'bar')" with a subsequent "update table set metadata=extract(content)" to create those files. (just some random thoughts, I'm sure others have spent more time thinking about this) This doesn't seem that different from the mechanism creating a deletion vector in Iceberg. It could also be seen as a view in iceberg joining on _row_id. This can be a topic in the meeting tomorrow. On Mon, Dec 8, 2025 at 9:02 AM Daniel Weeks <[email protected]> wrote: > Thanks for the context Kenny. That example is very similar to some of the > cases that come up in the multi-modal scenarios. > > I agree that we're in a little bit of a difficult situation due to lack of > existing support, which also leads to Micah's concern that it's a point of > confusion for implementers. > > I would be in favor of adding some additional context to the description > because there are some basic things implementers should do (e.g. validate > that the file path is either not set or set to the current file being read > if they don't support disaggregated column data). While older clients will > likely break if they encounter files written this way, there's almost no > risk that it would result in silent failures or corruption as I suspect > most implementations will read the ranges from the referencing file and not > be able to interpret it. > > Adding a read path is relatively straightforward (at least in the java > implementation for both stream and vectored IO reads), but the write path > is where things get more complicated. > > I think we want to discuss some of these use cases in more detail and see > if they are practical and reasonable. Some cases may make more sense at a > higher-level (like table metadata) while others may make sense to handle at > the file level (like asymmetric column sizes). > > -Dan > > > > On Sun, Dec 7, 2025 at 12:35 PM Kenny Daniel <[email protected]> wrote: > > > Since I was the one who brought up file_path at the sync a couple weeks > > ago, I'll share my thoughts: > > > > I am interested in the file_path field for column chunks because it would > > allow for some extremely efficient data engineering in specific cases > > like *adding > > a column to existing data*. > > > > My use case is LLM data. LLM data is often huge piles of text in parquet > > format (see: all of huggingface, or any llm request/response logs). If I > > have a 400mb source.parquet file, how can I annotate each row with an > added > > "score" column efficiently? I would prefer to not have to copy all 400mb > of > > data just to add a "score" column. It would be slick if I could make a > new > > annotated.parquet file that points to source.parquet for the source > > columns, and then only includes the new "score" column in the > > annotated.parquet file. The source.parquet would remain 400mb, the > > annotated parquet could be ~10kb and incorporate the source data by > > reference. > > > > As the implementor of hyparquet I have conflicting opinions on this > > feature. On the one hand, it's a cool capability, already built into > > parquet. On the other hand... none of the parquet implementations support > > it. Hyparquet has a branch for reading/writing file_path that I used for > > testing. It does work. But I don't want to ship it unless theres at least > > ONE other implementation that supports it (there isn't). > > > > I agree that this would be better implemented at the table format level > > (eg- iceberg). BUT... *iceberg does not support my adding column use > case*! > > The problem is that, despite parquet being a column-oriented format, > > iceberg has no support to efficiently zip a new column with existing > data. > > The only option for "add column" in iceberg would be to *add a column > with > > default values and then re-write every row* (including the heavy text > > data). So iceberg fails to solve my problem at all. > > > > Anyway, I'm fine with deprecating, or not. But I did want to at least > make > > the case that it could serve a purpose that I don't see any other good > way > > of solving at the moment. > > > > Kenny > > > > > > > > On Fri, Dec 5, 2025 at 9:46 PM Micah Kornfield <[email protected]> > > wrote: > > > > > Hi Dan, > > > > > > > However, there are ongoing discussions around multi-modal cases where > > > > either separating large columns (e.g. inline blobs) or appending > column > > > > data without rewriting existing data may leverage this. > > > > > > > > > Do you have any design docs or mailing list discussions you can point > to? > > > > > > I don't feel like leaving this for now while we explore those use cases > > > > would cause any additional confusion/complexity. > > > > > > > > > Agreed, it isn't urgent to clean this up. But having a more concrete > > > timeline would be helpful, this does seem to be a semi-regular source > of > > > confusion for folks, so it would be nice to clean up the loose end. > > > > > > Thanks, > > > Micah > > > > > > On Fri, Dec 5, 2025 at 4:07 PM Daniel Weeks <[email protected]> wrote: > > > > > > > I'd actually prefer that we don't deprecate this field (at least not > > > > immediately). > > > > > > > > Recognizing that we've discussed separating column data into multiple > > > files > > > > for over a decade without any concrete implementations, there are > > > emerging > > > > use cases that may benefit from investing in this feature. > > > > > > > > Many of the use cases in the past have been misaligned (e.g. > separating > > > > column data for security/encryption) and better alternatives > addressed > > > > those scenarios. > > > > > > > > However, there are ongoing discussions around multi-modal cases where > > > > either separating large columns (e.g. inline blobs) or appending > column > > > > data without rewriting existing data may leverage this. > > > > > > > > I don't feel like leaving this for now while we explore those use > cases > > > > would cause any additional confusion/complexity. > > > > > > > > -Dan > > > > > > > > On Thu, Dec 4, 2025 at 9:04 AM Micah Kornfield < > [email protected]> > > > > wrote: > > > > > > > > > > What does "deprecated" entail here? Do we plan to remove this > field > > > > > from the format? Otherwise, is it just documentation? > > > > > > > > > > I was imagining just documentation, since we don't want to break > the > > > > > "_metadata file" use case. > > > > > > > > > > On Thu, Dec 4, 2025 at 8:18 AM Antoine Pitrou <[email protected]> > > > > wrote: > > > > > > > > > > > > > > > > > What does "deprecated" entail here? Do we plan to remove this > field > > > > > > from the format? Otherwise, is it just documentation? > > > > > > > > > > > > > > > > > > > > > > > > On Mon, 1 Dec 2025 12:09:18 -0800 > > > > > > Micah Kornfield <[email protected]> > > > > > > wrote: > > > > > > > This has come up a few times in the sync and other forums. I > > > wanted > > > > to > > > > > > > start the conversation about deprecating file_path > > > > > > > < > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 > > > > > > > > > > > > > > [1] in the parquet footer. > > > > > > > > > > > > > > Outside of the "_metadata" file index use-case I don't think > this > > > is > > > > > used > > > > > > > or implemented in any reader (effectively a poor man's table > > > format). > > > > > > > > > > > > > > With the rise of file formats, it seems like a reasonable > design > > > > choice > > > > > > to > > > > > > > push complexity of referencing columns across files to the > table > > > > level > > > > > > and > > > > > > > keep parquet focused on single file storage (encodings, > indexing, > > > > etc). > > > > > > > > > > > > > > Implementing this at a file level also can be challenging in > the > > > > > context > > > > > > of > > > > > > > knowing all credentials one might need to read from different > > > objects > > > > > on > > > > > > > object storage? > > > > > > > > > > > > > > Thoughts/Objections? > > > > > > > > > > > > > > Thanks, > > > > > > > Micah > > > > > > > > > > > > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/parquet-format/blob/3ab52ff2e4e1cbe4c52a3e25c0512803e860c454/src/main/thrift/parquet.thrift#L962 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
