Re: [DISCUSS] integration against parquet files?

Jorge Cardoso Leitão Tue, 09 Mar 2021 03:50:20 -0800

Thanks a lot for your comments, and for the pointers.

I think that we could separate this in 3 separate parts:

1. arrow implementation X (e.g. C++) reads a on-spec parquet file
2. arrow implementation X read a parquet file written by arrow
implementation Y
3. arrow implementation X reads an parquet file from (non-arrow)
implementation Z (e.g. with spark x.y.z metadata)

For part 1, one idea is to declare a one-to-one mapping between parquet's
.json being discussed on the issue that Micah pointed to, and arrows'
.json. This allows us to leverage, and align with, parquet's work: they
create "parquet -> easy-to-read json"; we "manually" perform a
"easy-to-read json -> arrow's json" conversion for a set of files, which
spec out how an arrow implementation should read a parquet. We store those
on a submodule and use then to validate that our implementations read from
.parquet equals to what we declared in our spec (the arrow json files
"manually" created).

Note that this outlines a contract over how to read a parquet that has no
arrow metadata. We may skip this altogether and leave it for each
implementation on how to do this.

For part 2; I was thinking something along the lines of

given a json file representing arrow's in-memory format, read from arrow
> implementation X, write to parquet, read parquet implementation Y, compare
> against json
>

and permute among implementations that have parquet IO capabilities.

This outlines the contract over which each implementation should read a
parquet file _written by another implementation_, and how each
implementation should write a parquet file. This may be considered an
extension of the arrow spec (as it describes how it the in-memory format
should be represented in parquet). Note how this poses no requirement into
whether non-arrow implementations read these parquet or not. E.g. here we
could state that what C++ writes is what is right regardless of what the
parquet spec says.

Part 1 ensures that we can read from a parquet without arrow metadata into
memory consistently; part 2 ensures that implementations can communicate
consistently with parquet files. If we ensure both, won't they form a
necessary and sufficient condition for integration with parquet's ecosystem?

I won't comment on problem 3, as my understanding is that there are many
degrees of freedom to consider and we should just try do it on a best
effort basis.

Would something like this make sense?

Best,
Jorge

On Tue, Mar 9, 2021 at 12:00 AM Ian Cook <i...@ursacomputing.com> wrote:

> I had some similar thoughts recently and wrote a bit of code to automate
> the process of running PySpark jobs that write sample Parquet files with
> multiple versions of Spark:
> https://github.com/ursa-labs/parquet-lot
>
> That repo provides some lightweight scaffolding and includes two simple
> example tasks (and instructions describing how to create more). Each task
> defines a small JSON dataset and schema, reads the data into a Spark
> DataFrame, and writes it out as a Parquet file. It also writes out the
> original JSON for reference. You can run each task across multiple versions
> of Spark in several compression formats. Everything runs in GitHub Actions
> with manual triggers.
>
> The initial goal of this was simply to generate some sets of static Parquet
> files (with JSON reference files) that could be added to the Arrow test
> suite. Alternatively, it could be spun into something more ambitious, or
> the test harness code (a part of which is in the separate repo at
> https://github.com/ursa-labs/polyspark) could be adapted for use in the
> Arrow test suite.
>
> Ian
>
> On Sun, Mar 7, 2021 at 4:16 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > I think we're definitely overdue in having Parquet integration tests,
> > not just in Arrow but data ecosystem wide. It's a bit unfortunate that
> > the extent of integration testing ends up being "let's see if it
> > works" and "if it breaks we'll fix it". Some implementations may write
> > files which can't be consistently or correctly read by all Parquet
> > implementations, which defeats the purpose of having a standard. Part
> > of the reason I've tried to discourage non-trivial third party Arrow
> > implementations is to help Arrow avoid the fate of Parquet, and I
> > think we've largely succeeded so far at this.
> >
> > It would be great if we could at least achieve some internal
> > consistency within Arrow-friendly Parquet implementations like we have
> > here. I suspect we could reasonably easily create a testing harness
> > with Spark to ingest JSON and output Parquet (and vice versa) given an
> > expected schema. Perhaps we can organize a parent Jira around Parquet
> > integration testing and create a bunch of issues enumerating what we
> > would like to do?
> >
> > On Fri, Mar 5, 2021 at 2:48 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> > >
> > > Some planning has started around this in PARQUET-1985 [1].  It seems it
> > > would be relatively easy in the short term for Rust and C++ to reuse
> > > archery for this purpose.  I agree it is a good thing.
> > >
> > >
> > > [1] https://issues.apache.org/jira/browse/PARQUET-1985
> > >
> > > On Fri, Mar 5, 2021 at 12:42 PM Jorge Cardoso Leitão <
> > > jorgecarlei...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > To run integration with IPC, we have a set of `.arrow` and `.stream`
> > files
> > > > that different implementations consume and can compare against
> "golden"
> > > > .json files that contain the corresponding in-memory representation.
> > > >
> > > > I wonder if we should not have equivalent json files corresponding
> to a
> > > > curated set of parquet files so that implementations can validate
> > parity
> > > > with the C++ implementation with respect to how parquet should be
> > converted
> > > > from and to arrow.
> > > >
> > > > Any opinions?
> > > >
> > > > Best,
> > > > Jorge
> > > >
> >
>

Re: [DISCUSS] integration against parquet files?

Reply via email to