Just a note that I have preliminary example files up for geometry/geography at [1]. We are planning some GeoArrow integration tests in geoarrow-data [2] and will probably tack Parquet on to those as well as whatever system the Parquet community comes up with.
Cheers, -dewey [1] https://github.com/apache/parquet-testing/pull/70 [2] https://github.com/geoarrow/geoarrow-data On Sat, Feb 15, 2025 at 8:10 AM Andrew Lamb <[email protected]> wrote: > I think getting something setup, initially focused on variant (or geometry) > and then expanding it over time makes lots of sense to me > > Andrew > > On Fri, Feb 14, 2025 at 5:36 PM Bryce Mecum <[email protected]> wrote: > > > Hi Gang, that does seem like a good idea. Would there be any benefit > > to trying that with the active spec changes like GEOMETRY/GEOGRAPHY or > > VARIANT? > > > > On Wed, Feb 5, 2025 at 9:14 PM Gang Wu <[email protected]> wrote: > > > > > > As the troublemaker of the mentioned issue above, I'd say that > > > a lesson learned is that we should publish example files for any > > > new feature to the parquet-testing [1] repo for interoperability tests. > > > Perhaps we need a staging repo/branch to store produced files > > > during the active development. This may help catch common issues > > > as early as possible. > > > > > > [1] https://github.com/apache/parquet-testing > > > > > > Best, > > > Gang > > > > > > On Thu, Jan 30, 2025 at 6:55 PM Andrew Lamb <[email protected]> > > wrote: > > > > > > > This is a great idea. There is a previous discussion about a similar > > idea > > > > here[1] > > > > > > > > Specifically, I think Alkis's sketch of the "carpenter" program would > > have > > > > caught this situation. > > > > > > > > In my opinion, improving interoperability testing like this is a key > > step > > > > towards being able to reliably evolve the Parquet standard itself. > > > > > > > > Andrew > > > > > > > > [1]: https://github.com/apache/parquet-format/issues/441 > > > > > > > > On Wed, Jan 29, 2025 at 3:49 PM Bryce Mecum <[email protected]> > > wrote: > > > > > > > > > Hello Parquet community, > > > > > > > > > > The Arrow project recently fixed a bug [1] in its C++ Parquet > > > > > implementation that was causing compliant Parquet files written by > > > > > recent versions of parquet-rs [2] to be unreadable by the C++ > > > > > implementation due to differences in the implementation of > Parquet’s > > > > > SizeStatistics feature [3]. This also affected the Arrow libraries > > > > > that bind to the C++ implementation, including PyArrow. The C++ > > > > > implementation has been patched [4] and a new Arrow release > (19.0.1) > > > > > is in the works. > > > > > > > > > > Given this, I wanted to start a discussion about what kind of > > > > > cross-implementation testing facilities may already exist in any of > > > > > the Parquet implementations and what kind of testing facilities > might > > > > > be created to help catch situations like these. > > > > > > > > > > I’ll start off with my thoughts and encourage people to jump in: > > > > > > > > > > 1. The specific integration test that could have been run to catch > > > > > this bug would be a test that used the Arrow 19.0.0 release > candidate > > > > > to read any Parquet file written by parquet-rs >=53.0. This would > > have > > > > > halted the release process. Should the Arrow project just add a CI > > job > > > > > like this and move on? > > > > > 2. Testing every combination of Parquet format versions, feature > > > > > toggles, implementations, and implementation versions is clearly > too > > > > > large a problem to solve so it might be best to start off with a > > > > > narrow scope. > > > > > > > > > > Please note that I've cross-posted this to the Apache Arrow mailing > > > > > list. Please reply to the Apache Parquet post. I’m looking forward > to > > > > > hearing others’ thoughts and ideas. > > > > > > > > > > Thanks, > > > > > Bryce > > > > > > > > > > [1] https://github.com/apache/arrow/issues/45283 > > > > > [2] https://github.com/apache/arrow-rs/tree/main/parquet > > > > > [3] https://github.com/apache/parquet-format/pull/197 > > > > > [4] https://github.com/apache/arrow/pull/45285 > > > > > > > > > > > >
