I had some similar thoughts recently and wrote a bit of code to automate
the process of running PySpark jobs that write sample Parquet files with
multiple versions of Spark:
https://github.com/ursa-labs/parquet-lot

That repo provides some lightweight scaffolding and includes two simple
example tasks (and instructions describing how to create more). Each task
defines a small JSON dataset and schema, reads the data into a Spark
DataFrame, and writes it out as a Parquet file. It also writes out the
original JSON for reference. You can run each task across multiple versions
of Spark in several compression formats. Everything runs in GitHub Actions
with manual triggers.

The initial goal of this was simply to generate some sets of static Parquet
files (with JSON reference files) that could be added to the Arrow test
suite. Alternatively, it could be spun into something more ambitious, or
the test harness code (a part of which is in the separate repo at
https://github.com/ursa-labs/polyspark) could be adapted for use in the
Arrow test suite.

Ian

On Sun, Mar 7, 2021 at 4:16 PM Wes McKinney <wesmck...@gmail.com> wrote:

> I think we're definitely overdue in having Parquet integration tests,
> not just in Arrow but data ecosystem wide. It's a bit unfortunate that
> the extent of integration testing ends up being "let's see if it
> works" and "if it breaks we'll fix it". Some implementations may write
> files which can't be consistently or correctly read by all Parquet
> implementations, which defeats the purpose of having a standard. Part
> of the reason I've tried to discourage non-trivial third party Arrow
> implementations is to help Arrow avoid the fate of Parquet, and I
> think we've largely succeeded so far at this.
>
> It would be great if we could at least achieve some internal
> consistency within Arrow-friendly Parquet implementations like we have
> here. I suspect we could reasonably easily create a testing harness
> with Spark to ingest JSON and output Parquet (and vice versa) given an
> expected schema. Perhaps we can organize a parent Jira around Parquet
> integration testing and create a bunch of issues enumerating what we
> would like to do?
>
> On Fri, Mar 5, 2021 at 2:48 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
> >
> > Some planning has started around this in PARQUET-1985 [1].  It seems it
> > would be relatively easy in the short term for Rust and C++ to reuse
> > archery for this purpose.  I agree it is a good thing.
> >
> >
> > [1] https://issues.apache.org/jira/browse/PARQUET-1985
> >
> > On Fri, Mar 5, 2021 at 12:42 PM Jorge Cardoso Leitão <
> > jorgecarlei...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > To run integration with IPC, we have a set of `.arrow` and `.stream`
> files
> > > that different implementations consume and can compare against "golden"
> > > .json files that contain the corresponding in-memory representation.
> > >
> > > I wonder if we should not have equivalent json files corresponding to a
> > > curated set of parquet files so that implementations can validate
> parity
> > > with the C++ implementation with respect to how parquet should be
> converted
> > > from and to arrow.
> > >
> > > Any opinions?
> > >
> > > Best,
> > > Jorge
> > >
>

Reply via email to