Re: [DISCUSS] integration against parquet files?

Micah Kornfield Tue, 09 Mar 2021 19:44:18 -0800

IIUC the only difference between 1 and 2 is whether to use the serialized
schema from the parquet file in reconstruction?  Maybe we can pass that as
an additional flag through archery in the short term so we can already
reuse the JSON description we have for arrow files?


On Tue, Mar 9, 2021 at 3:50 AM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Thanks a lot for your comments, and for the pointers.
>
> I think that we could separate this in 3 separate parts:
>
> 1. arrow implementation X (e.g. C++) reads a on-spec parquet file
> 2. arrow implementation X read a parquet file written by arrow
> implementation Y
> 3. arrow implementation X reads an parquet file from (non-arrow)
> implementation Z (e.g. with spark x.y.z metadata)
>
> For part 1, one idea is to declare a one-to-one mapping between parquet's
> .json being discussed on the issue that Micah pointed to, and arrows'
> .json. This allows us to leverage, and align with, parquet's work: they
> create "parquet -> easy-to-read json"; we "manually" perform a
> "easy-to-read json -> arrow's json" conversion for a set of files, which
> spec out how an arrow implementation should read a parquet. We store those
> on a submodule and use then to validate that our implementations read from
> .parquet equals to what we declared in our spec (the arrow json files
> "manually" created).
>
> Note that this outlines a contract over how to read a parquet that has no
> arrow metadata. We may skip this altogether and leave it for each
> implementation on how to do this.
>
> For part 2; I was thinking something along the lines of
>
> given a json file representing arrow's in-memory format, read from arrow
> > implementation X, write to parquet, read parquet implementation Y,
> compare
> > against json
> >
>
> and permute among implementations that have parquet IO capabilities.
>
> This outlines the contract over which each implementation should read a
> parquet file _written by another implementation_, and how each
> implementation should write a parquet file. This may be considered an
> extension of the arrow spec (as it describes how it the in-memory format
> should be represented in parquet). Note how this poses no requirement into
> whether non-arrow implementations read these parquet or not. E.g. here we
> could state that what C++ writes is what is right regardless of what the
> parquet spec says.
>
> Part 1 ensures that we can read from a parquet without arrow metadata into
> memory consistently; part 2 ensures that implementations can communicate
> consistently with parquet files. If we ensure both, won't they form a
> necessary and sufficient condition for integration with parquet's
> ecosystem?
>
> I won't comment on problem 3, as my understanding is that there are many
> degrees of freedom to consider and we should just try do it on a best
> effort basis.
>
> Would something like this make sense?
>
> Best,
> Jorge
>
> On Tue, Mar 9, 2021 at 12:00 AM Ian Cook <i...@ursacomputing.com> wrote:
>
> > I had some similar thoughts recently and wrote a bit of code to automate
> > the process of running PySpark jobs that write sample Parquet files with
> > multiple versions of Spark:
> > https://github.com/ursa-labs/parquet-lot
> >
> > That repo provides some lightweight scaffolding and includes two simple
> > example tasks (and instructions describing how to create more). Each task
> > defines a small JSON dataset and schema, reads the data into a Spark
> > DataFrame, and writes it out as a Parquet file. It also writes out the
> > original JSON for reference. You can run each task across multiple
> versions
> > of Spark in several compression formats. Everything runs in GitHub
> Actions
> > with manual triggers.
> >
> > The initial goal of this was simply to generate some sets of static
> Parquet
> > files (with JSON reference files) that could be added to the Arrow test
> > suite. Alternatively, it could be spun into something more ambitious, or
> > the test harness code (a part of which is in the separate repo at
> > https://github.com/ursa-labs/polyspark) could be adapted for use in the
> > Arrow test suite.
> >
> > Ian
> >
> > On Sun, Mar 7, 2021 at 4:16 PM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > I think we're definitely overdue in having Parquet integration tests,
> > > not just in Arrow but data ecosystem wide. It's a bit unfortunate that
> > > the extent of integration testing ends up being "let's see if it
> > > works" and "if it breaks we'll fix it". Some implementations may write
> > > files which can't be consistently or correctly read by all Parquet
> > > implementations, which defeats the purpose of having a standard. Part
> > > of the reason I've tried to discourage non-trivial third party Arrow
> > > implementations is to help Arrow avoid the fate of Parquet, and I
> > > think we've largely succeeded so far at this.
> > >
> > > It would be great if we could at least achieve some internal
> > > consistency within Arrow-friendly Parquet implementations like we have
> > > here. I suspect we could reasonably easily create a testing harness
> > > with Spark to ingest JSON and output Parquet (and vice versa) given an
> > > expected schema. Perhaps we can organize a parent Jira around Parquet
> > > integration testing and create a bunch of issues enumerating what we
> > > would like to do?
> > >
> > > On Fri, Mar 5, 2021 at 2:48 PM Micah Kornfield <emkornfi...@gmail.com>
> > > wrote:
> > > >
> > > > Some planning has started around this in PARQUET-1985 [1].  It seems
> it
> > > > would be relatively easy in the short term for Rust and C++ to reuse
> > > > archery for this purpose.  I agree it is a good thing.
> > > >
> > > >
> > > > [1] https://issues.apache.org/jira/browse/PARQUET-1985
> > > >
> > > > On Fri, Mar 5, 2021 at 12:42 PM Jorge Cardoso Leitão <
> > > > jorgecarlei...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > To run integration with IPC, we have a set of `.arrow` and
> `.stream`
> > > files
> > > > > that different implementations consume and can compare against
> > "golden"
> > > > > .json files that contain the corresponding in-memory
> representation.
> > > > >
> > > > > I wonder if we should not have equivalent json files corresponding
> > to a
> > > > > curated set of parquet files so that implementations can validate
> > > parity
> > > > > with the C++ implementation with respect to how parquet should be
> > > converted
> > > > > from and to arrow.
> > > > >
> > > > > Any opinions?
> > > > >
> > > > > Best,
> > > > > Jorge
> > > > >
> > >
> >
>

Re: [DISCUSS] integration against parquet files?

Reply via email to