I think we're definitely overdue in having Parquet integration tests,
not just in Arrow but data ecosystem wide. It's a bit unfortunate that
the extent of integration testing ends up being "let's see if it
works" and "if it breaks we'll fix it". Some implementations may write
files which can't be consistently or correctly read by all Parquet
implementations, which defeats the purpose of having a standard. Part
of the reason I've tried to discourage non-trivial third party Arrow
implementations is to help Arrow avoid the fate of Parquet, and I
think we've largely succeeded so far at this.

It would be great if we could at least achieve some internal
consistency within Arrow-friendly Parquet implementations like we have
here. I suspect we could reasonably easily create a testing harness
with Spark to ingest JSON and output Parquet (and vice versa) given an
expected schema. Perhaps we can organize a parent Jira around Parquet
integration testing and create a bunch of issues enumerating what we
would like to do?

On Fri, Mar 5, 2021 at 2:48 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Some planning has started around this in PARQUET-1985 [1].  It seems it
> would be relatively easy in the short term for Rust and C++ to reuse
> archery for this purpose.  I agree it is a good thing.
>
>
> [1] https://issues.apache.org/jira/browse/PARQUET-1985
>
> On Fri, Mar 5, 2021 at 12:42 PM Jorge Cardoso Leitão <
> jorgecarlei...@gmail.com> wrote:
>
> > Hi,
> >
> > To run integration with IPC, we have a set of `.arrow` and `.stream` files
> > that different implementations consume and can compare against "golden"
> > .json files that contain the corresponding in-memory representation.
> >
> > I wonder if we should not have equivalent json files corresponding to a
> > curated set of parquet files so that implementations can validate parity
> > with the C++ implementation with respect to how parquet should be converted
> > from and to arrow.
> >
> > Any opinions?
> >
> > Best,
> > Jorge
> >

Reply via email to