Hi Ryan,

Thanks for the quick feedback! We also have an integration test version of
the contract validation, which uses in-memory data and validates the
min/max/boundaryorder, as well as the correctness of the filtering - this
would have been the next installment, so *spoiler alert* :)

https://github.com/apache/parquet-mr/blob/e7db9e20f52c925a207ea62d6dda6dc4e870294e/parquet-column/src/test/java/org/apache/parquet/internal/column/columnindex/TestColumnIndexBuilder.java

The test cases cover both the contract validation, as well as the filtering
for all types - a more extensive test suite than the tool.

For the tool/e2e case, most of the comments were improvements and
extensions, not lack of functionality - please correct me if I'm wrong.
While I agree that improvements can and will be made, are any of these
blockers?

Thanks,
Anna

On Tue, Feb 5, 2019 at 7:17 PM Ryan Blue <rb...@netflix.com.invalid> wrote:

> Thanks, Anna and Zoltan.
>
> Overall, I think the test does validate most of the contract, but needs to
> be improved. I found a few problems and commented on the commit. I would
> also like to see this as a real integration test in Parquet that is run to
> test the build. Why make this an external test?
>
> rb
>
> On Mon, Feb 4, 2019 at 8:42 AM Anna Szonyi <szo...@cloudera.com.invalid>
> wrote:
>
> > Hi dev@,
> >
> > As part of acquiring the final sign offs for the parquet release, we had
> > another discussion with Ryan about the testing of column indexes.
> > He asked us to break down our testing into "fun size" test bites and send
> > them out incrementally to the community, so everyone would have time to
> > digest and focus on the testing case by case.
> >
> > We also agreed that we will be focusing on the write path, since that is
> > the critical part that we have to support for the long-haul.
> >
> > I will be sending out the related tests, please feel free to ask
> clarifying
> > questions or if you have a particular test case you want to see next,
> > please let me know.
> >
> > As a first "taste" Zoltan created the following tool to *validate the
> > contract* for the indices with random generated data:
> >
> > - The min value stored in the index must be less than or equal to all
> >   values.
> > - The max value stored in the index must be greater than or equal to all
> >   values.
> > - The null count stored in the index must be equal to the number of
> >   nulls.
> > - Only pages consisting entirely of NULL-s can be marked as a null page
> >   in the index.
> > - According to the ASCENDING boundary order, the min value for a page
> >   must be greater than or equal to the min value of the previous page.
> > - According to the ASCENDING boundary order, the max value for a page
> >   must be greater than or equal to the max value of the previous page.
> > - According to the DESCENDING boundary order, the min value for a page
> >   must be less than or equal to the min value of the previous page.
> > - According to the DESCENDING boundary order, the max value for a page
> >   must be less than or equal to the max value of the previous page.
> >
> >
> >
> https://github.com/zivanfi/parquet-mr/commit/0e74b4207daa0b53cf4bd0774d2632c388845cb9
> > The code itself is unpolished, but fully functional and complete.
> > Once the release is signed off we plan to refactor this and offer it as
> > part of parquet tools or parquet cli,
> > however it is perfectly fine for validation as-is.
> >
> > Best,
> > Anna
> >
>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Reply via email to