Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Andy Grove Sat, 12 Oct 2019 09:13:14 -0700

I also think that there are valid use cases for checking in binary files,
but we have to be careful not to abuse this. For example, we might want to
check in a Parquet file created by a particular version of Apache Spark to
ensure that Arrow implementations can read it successfully (hypothetical
example).


It would also be good to have a small set of Parquet files using every
possible data type that all implementations can use in their tests. I
suppose we might want one set per Arrow format version as well.

The problem we have now, in my opinion, is that we're proposing adding
files on a pretty ad-hoc basis, driven by the needs of individual
contributors in one language implementation, and this is perhaps happening
because we don't already have a good set of standard test files.

Renjie - perhaps you could comment on this. If we had these standard files
covering all data types, would that have worked for you in this instance?

Thanks,

Andy.

On Sat, Oct 12, 2019 at 12:03 AM Micah Kornfield <[email protected]>
wrote:

> Hi Wes,
> >
> > I additionally would prefer generating the test corpus at test time
> > rather than checking in binary files.
>
>
> Can you elaborate on this? I think both generated on the fly and example
> files are useful.
>
> The checked in files catch regressions even when readers/writers can read
> their own data but they have either incorrect or undefined behavior in
> regards to the specification (for example I would imagine checking in a
> file as part of the fix for ARROW-6844
> <https://issues.apache.org/jira/browse/ARROW-6844>).
>
> Thanks,
> Micah
>
> On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <[email protected]>
> wrote:
>
> > Thanks wes. Sure I'll fix it.
> >
> > Wes McKinney <[email protected]> 于 2019年10月11日周五 上午6:10写道：
> >
> > > I just merged the PR https://github.com/apache/arrow-testing/pull/11
> > >
> > > Various aspects of this make me uncomfortable so I hope they can be
> > > addressed in follow up work
> > >
> > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <[email protected]>
> > > wrote:
> > > >
> > > > I've create ticket to track here:
> > > > https://issues.apache.org/jira/browse/ARROW-6845
> > > >
> > > > For this moment, can we check in those pregenerated data to unblock
> > rust
> > > > version's arrow reader?
> > > >
> > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <[email protected]>
> > > wrote:
> > > >
> > > > > It would be fine in that case.
> > > > >
> > > > > Wes McKinney <[email protected]> 于 2019年10月10日周四 下午12:58写道：
> > > > >
> > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > 1. There already exists a low level parquet writer which can
> > produce
> > > > >> > parquet file, so unit test should be fine. But writer from arrow
> > to
> > > > >> parquet
> > > > >> > doesn't exist yet, and it may take some period of time to finish
> > it.
> > > > >> > 2. In fact my data are randomly generated and it's definitely
> > > > >> reproducible.
> > > > >> > However, I don't think it would be good idea to randomly
> generate
> > > data
> > > > >> > everytime we run ci because it would be difficult to debug. For
> > > example
> > > > >> PR
> > > > >> > a introduced a bug, which is triggerred in other PR's build it
> > > would be
> > > > >> > confusing for contributors.
> > > > >>
> > > > >> Presumably any random data generation would use a fixed seed
> > precisely
> > > > >> to be reproducible.
> > > > >>
> > > > >> > 3. I think it would be good idea to spend effort on integration
> > test
> > > > >> with
> > > > >> > parquet because it's an important use case of arrow. Also
> similar
> > > > >> approach
> > > > >> > could be extended to other language and other file format(avro,
> > > orc).
> > > > >> >
> > > > >> >
> > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <
> [email protected]
> > >
> > > > >> wrote:
> > > > >> >
> > > > >> > > There are a number of issues worth discussion.
> > > > >> > >
> > > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet
> > > _writer_?
> > > > >> > > It's OK to be reliant on other libraries in the short term to
> > > produce
> > > > >> > > files to test against, but does not strike me as a sustainable
> > > > >> > > long-term plan. Fixing bugs can be a lot more difficult than
> it
> > > needs
> > > > >> > > to be if you can't write targeted "endogenous" unit tests
> > > > >> > >
> > > > >> > > 2. Reproducible data generation
> > > > >> > >
> > > > >> > > I think if you're going to test against a pre-generated
> corpus,
> > > you
> > > > >> > > should make sure that generating the corpus is reproducible
> for
> > > other
> > > > >> > > developers (i.e. with a Dockerfile), and can be extended by
> > > adding new
> > > > >> > > files or random data generation.
> > > > >> > >
> > > > >> > > I additionally would prefer generating the test corpus at test
> > > time
> > > > >> > > rather than checking in binary files. If this isn't viable
> right
> > > now
> > > > >> > > we can create an "arrow-rust-crutch" git repository for you to
> > > stash
> > > > >> > > binary files until some of these testing scalability issues
> are
> > > > >> > > addressed.
> > > > >> > >
> > > > >> > > If we're going to spend energy on Parquet integration testing
> > with
> > > > >> > > Java, this would be a good opportunity to do the work in a way
> > > where
> > > > >> > > the C++ Parquet library can also participate (since we ought
> to
> > be
> > > > >> > > doing integration tests with Java, and we can also read JSON
> > > files to
> > > > >> > > Arrow).
> > > > >> > >
> > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu <
> > > [email protected]>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove <
> > > [email protected]>
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > > I'm very interested in helping to find a solution to this
> > > because
> > > > >> we
> > > > >> > > really
> > > > >> > > > > do need integration tests for Rust to make sure we're
> > > compatible
> > > > >> with
> > > > >> > > other
> > > > >> > > > > implementations... there is also the ongoing CI
> > dockerization
> > > work
> > > > >> > > that I
> > > > >> > > > > feel is related.
> > > > >> > > > >
> > > > >> > > > > I haven't looked at the current integration tests yet and
> > > would
> > > > >> > > appreciate
> > > > >> > > > > some pointers on how all of this works (do we have docs?)
> or
> > > > >> where to
> > > > >> > > start
> > > > >> > > > > looking.
> > > > >> > > > >
> > > > >> > > > I have a test in my latest PR:
> > > > >> https://github.com/apache/arrow/pull/5523
> > > > >> > > > And here is the generated data:
> > > > >> > > > https://github.com/apache/arrow-testing/pull/11
> > > > >> > > > As with program to generate these data, it's just a simple
> > java
> > > > >> program.
> > > > >> > > > I'm not sure whether we need to integrate it into arrow.
> > > > >> > > >
> > > > >> > > > >
> > > > >> > > > > I imagine the integration test could follow the approach
> > that
> > > > >> Renjie is
> > > > >> > > > > outlining where we call Java to generate some files and
> then
> > > call
> > > > >> Rust
> > > > >> > > to
> > > > >> > > > > parse them?
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > >
> > > > >> > > > > Andy.
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu <
> > > > >> [email protected]>
> > > > >> > > wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi:
> > > > >> > > > > >
> > > > >> > > > > > I'm developing rust version of reader which reads
> parquet
> > > into
> > > > >> arrow
> > > > >> > > > > array.
> > > > >> > > > > > To verify the correct of this reader, I use the
> following
> > > > >> approach:
> > > > >> > > > > >
> > > > >> > > > > >
> > > > >> > > > > >    1. Define schema with protobuf.
> > > > >> > > > > >    2. Generate json data of this schema using other
> > language
> > > > >> with
> > > > >> > > more
> > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > >> > > > > >    3. Generate parquet data of this schema using other
> > > language
> > > > >> with
> > > > >> > > more
> > > > >> > > > > >    sophisticated implementation (e.g. java)
> > > > >> > > > > >    4. Write tests to read json file, and parquet file
> into
> > > > >> memory
> > > > >> > > (arrow
> > > > >> > > > > >    array), then compare json data with arrow data.
> > > > >> > > > > >
> > > > >> > > > > >  I think with this method we can guarantee the
> correctness
> > > of
> > > > >> arrow
> > > > >> > > > > reader
> > > > >> > > > > > because json format is ubiquitous and their
> implementation
> > > are
> > > > >> more
> > > > >> > > > > stable.
> > > > >> > > > > >
> > > > >> > > > > > Any comment is appreciated.
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > --
> > > > >> > > > Renjie Liu
> > > > >> > > > Software Engineer, MVAD
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Renjie Liu
> > > > >> > Software Engineer, MVAD
> > > > >>
> > > > >
> > > >
> > > > --
> > > > Renjie Liu
> > > > Software Engineer, MVAD
> > >
> >
>

Re: [DISCUSS] Proposal about integration test of arrow parquet reader

Reply via email to