Thanks a lot for your comments, and for the pointers. I think that we could separate this in 3 separate parts:
1. arrow implementation X (e.g. C++) reads a on-spec parquet file 2. arrow implementation X read a parquet file written by arrow implementation Y 3. arrow implementation X reads an parquet file from (non-arrow) implementation Z (e.g. with spark x.y.z metadata) For part 1, one idea is to declare a one-to-one mapping between parquet's .json being discussed on the issue that Micah pointed to, and arrows' .json. This allows us to leverage, and align with, parquet's work: they create "parquet -> easy-to-read json"; we "manually" perform a "easy-to-read json -> arrow's json" conversion for a set of files, which spec out how an arrow implementation should read a parquet. We store those on a submodule and use then to validate that our implementations read from .parquet equals to what we declared in our spec (the arrow json files "manually" created). Note that this outlines a contract over how to read a parquet that has no arrow metadata. We may skip this altogether and leave it for each implementation on how to do this. For part 2; I was thinking something along the lines of given a json file representing arrow's in-memory format, read from arrow > implementation X, write to parquet, read parquet implementation Y, compare > against json > and permute among implementations that have parquet IO capabilities. This outlines the contract over which each implementation should read a parquet file _written by another implementation_, and how each implementation should write a parquet file. This may be considered an extension of the arrow spec (as it describes how it the in-memory format should be represented in parquet). Note how this poses no requirement into whether non-arrow implementations read these parquet or not. E.g. here we could state that what C++ writes is what is right regardless of what the parquet spec says. Part 1 ensures that we can read from a parquet without arrow metadata into memory consistently; part 2 ensures that implementations can communicate consistently with parquet files. If we ensure both, won't they form a necessary and sufficient condition for integration with parquet's ecosystem? I won't comment on problem 3, as my understanding is that there are many degrees of freedom to consider and we should just try do it on a best effort basis. Would something like this make sense? Best, Jorge On Tue, Mar 9, 2021 at 12:00 AM Ian Cook <i...@ursacomputing.com> wrote: > I had some similar thoughts recently and wrote a bit of code to automate > the process of running PySpark jobs that write sample Parquet files with > multiple versions of Spark: > https://github.com/ursa-labs/parquet-lot > > That repo provides some lightweight scaffolding and includes two simple > example tasks (and instructions describing how to create more). Each task > defines a small JSON dataset and schema, reads the data into a Spark > DataFrame, and writes it out as a Parquet file. It also writes out the > original JSON for reference. You can run each task across multiple versions > of Spark in several compression formats. Everything runs in GitHub Actions > with manual triggers. > > The initial goal of this was simply to generate some sets of static Parquet > files (with JSON reference files) that could be added to the Arrow test > suite. Alternatively, it could be spun into something more ambitious, or > the test harness code (a part of which is in the separate repo at > https://github.com/ursa-labs/polyspark) could be adapted for use in the > Arrow test suite. > > Ian > > On Sun, Mar 7, 2021 at 4:16 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > I think we're definitely overdue in having Parquet integration tests, > > not just in Arrow but data ecosystem wide. It's a bit unfortunate that > > the extent of integration testing ends up being "let's see if it > > works" and "if it breaks we'll fix it". Some implementations may write > > files which can't be consistently or correctly read by all Parquet > > implementations, which defeats the purpose of having a standard. Part > > of the reason I've tried to discourage non-trivial third party Arrow > > implementations is to help Arrow avoid the fate of Parquet, and I > > think we've largely succeeded so far at this. > > > > It would be great if we could at least achieve some internal > > consistency within Arrow-friendly Parquet implementations like we have > > here. I suspect we could reasonably easily create a testing harness > > with Spark to ingest JSON and output Parquet (and vice versa) given an > > expected schema. Perhaps we can organize a parent Jira around Parquet > > integration testing and create a bunch of issues enumerating what we > > would like to do? > > > > On Fri, Mar 5, 2021 at 2:48 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > > > > > > Some planning has started around this in PARQUET-1985 [1]. It seems it > > > would be relatively easy in the short term for Rust and C++ to reuse > > > archery for this purpose. I agree it is a good thing. > > > > > > > > > [1] https://issues.apache.org/jira/browse/PARQUET-1985 > > > > > > On Fri, Mar 5, 2021 at 12:42 PM Jorge Cardoso Leitão < > > > jorgecarlei...@gmail.com> wrote: > > > > > > > Hi, > > > > > > > > To run integration with IPC, we have a set of `.arrow` and `.stream` > > files > > > > that different implementations consume and can compare against > "golden" > > > > .json files that contain the corresponding in-memory representation. > > > > > > > > I wonder if we should not have equivalent json files corresponding > to a > > > > curated set of parquet files so that implementations can validate > > parity > > > > with the C++ implementation with respect to how parquet should be > > converted > > > > from and to arrow. > > > > > > > > Any opinions? > > > > > > > > Best, > > > > Jorge > > > > > > >