One data point from the fleet. In our testing with a reader that ignores path_in_schema we have found that there are writers in the wild that do not follow the spec but path_in_schema saves them. The example is a parquet file with N leaf schema elements and K column metadata per row group, where K < N. If one resolves with path_in_schema the selected columns are found and work. If one matches with schema element order - chaos ensues.
To err to the side of caution we should not do this change lightly. We need a version change to drop this field otherwise we risk failed reads and even worse data loss. Consider the case of many INT32 columns, where one of them is missing in column metadata. If index based resolution lands in the wrong column but the type matches it will happily read it even though it is the wrong column. On Fri, May 29, 2026 at 6:14 AM Ed Seidl <[email protected]> wrote: > Hi all, > Quick update on this. A third PoC implementation in arrow-cpp has been > created [1], and a file > without the path_in_schema field (created with arrow-rs) has been > submitted to parquet-testing [2]. I've confirmed that the java and cpp PoCs > can properly read the file. I'll be proposing a vote on this proposal soon > if no objections are raised here or in the PR [3]. > > Cheers, > Ed > > [1] https://github.com/apache/arrow/pull/49707 > [2] https://github.com/apache/parquet-testing/pull/108 > [3] https://github.com/apache/parquet-format/pull/564 > > On 2026/04/22 20:58:46 Micah Kornfield wrote: > > I need to review the implementations more carefully, but I think this > looks > > good. Maybe we should give people through next week for people to review > > and then we can start a vote? > > > > On Wed, Apr 22, 2026 at 1:45 PM Steve Loughran <[email protected]> > wrote: > > > > > following on from the discussion today > > > > > > > > > 1. I can see the benefits in tagging it as optional > > > 2. it would be a long time before the systems I field support calls > over > > > would stop generating it because we don't know where data would end > up > > > being used. > > > 3. For those people who are encountering major problems here, it > would > > > at least be possible to say "provided you intend to only work with > > > versions > > > of <product> dated 2027 or newer, all is good. > > > > > > making the field optional as soon as possible would increase the time > at > > > which parquet releases can actually stop adding the field. > > > > > > Being able to tie it to a non-backwards-compatible database change > (and I'm > > > thinking Iceberg v4 tables) would provide a clear way to scope that > > > incompatibility. Imagine if iceberg was set up to turn the feature of > when > > > generating files for v4 tables, knowing all applications which could > read > > > the tables wouldn't need path_in_schema. *regardless of the language of > > > that implementation* > > > > > > steve > > > > > > On Mon, 20 Apr 2026 at 09:34, Gang Wu <[email protected]> wrote: > > > > > > > Thanks Ed for raising this! > > > > > > > > Overall I'm +1 to this. We need input from others since it is a > slight > > > > breaking change. > > > > > > > > Best, > > > > Gang > > > > > > > > On Thu, Apr 9, 2026 at 9:41 PM Ed Seidl <[email protected]> wrote: > > > > > > > > > Hi All, > > > > > > > > > > Following a lively discussion on this list, I thought I’d take a > stab > > > at > > > > > addressing one pain point in the Parquet footer. I’ve put up a > proposal > > > > [1] > > > > > and PR [2] to switch path_in_schema in the ColumnMetaData from > > > “required” > > > > > to “optional”. I’ve also whipped up PoCs in Rust [3] and Java [4]. > > > > > > > > > > Please take a look and let’s discuss in the PR. > > > > > > > > > > Thanks, > > > > > Ed > > > > > > > > > > [1] https://github.com/apache/parquet-format/issues/563 > > > > > [2] https://github.com/apache/parquet-format/pull/564 > > > > > [3] https://github.com/apache/arrow-rs/pull/9678 > > > > > [4] https://github.com/apache/parquet-java/pull/3470 > > > > > > > > > > > > > > >
