Hi Wes, > > I additionally would prefer generating the test corpus at test time > rather than checking in binary files.
Can you elaborate on this? I think both generated on the fly and example files are useful. The checked in files catch regressions even when readers/writers can read their own data but they have either incorrect or undefined behavior in regards to the specification (for example I would imagine checking in a file as part of the fix for ARROW-6844 <https://issues.apache.org/jira/browse/ARROW-6844>). Thanks, Micah On Thu, Oct 10, 2019 at 5:30 PM Renjie Liu <liurenjie2...@gmail.com> wrote: > Thanks wes. Sure I'll fix it. > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月11日周五 上午6:10写道: > > > I just merged the PR https://github.com/apache/arrow-testing/pull/11 > > > > Various aspects of this make me uncomfortable so I hope they can be > > addressed in follow up work > > > > On Thu, Oct 10, 2019 at 5:41 AM Renjie Liu <liurenjie2...@gmail.com> > > wrote: > > > > > > I've create ticket to track here: > > > https://issues.apache.org/jira/browse/ARROW-6845 > > > > > > For this moment, can we check in those pregenerated data to unblock > rust > > > version's arrow reader? > > > > > > On Thu, Oct 10, 2019 at 1:20 PM Renjie Liu <liurenjie2...@gmail.com> > > wrote: > > > > > > > It would be fine in that case. > > > > > > > > Wes McKinney <wesmck...@gmail.com> 于 2019年10月10日周四 下午12:58写道: > > > > > > > >> On Wed, Oct 9, 2019 at 10:16 PM Renjie Liu <liurenjie2...@gmail.com > > > > > >> wrote: > > > >> > > > > >> > 1. There already exists a low level parquet writer which can > produce > > > >> > parquet file, so unit test should be fine. But writer from arrow > to > > > >> parquet > > > >> > doesn't exist yet, and it may take some period of time to finish > it. > > > >> > 2. In fact my data are randomly generated and it's definitely > > > >> reproducible. > > > >> > However, I don't think it would be good idea to randomly generate > > data > > > >> > everytime we run ci because it would be difficult to debug. For > > example > > > >> PR > > > >> > a introduced a bug, which is triggerred in other PR's build it > > would be > > > >> > confusing for contributors. > > > >> > > > >> Presumably any random data generation would use a fixed seed > precisely > > > >> to be reproducible. > > > >> > > > >> > 3. I think it would be good idea to spend effort on integration > test > > > >> with > > > >> > parquet because it's an important use case of arrow. Also similar > > > >> approach > > > >> > could be extended to other language and other file format(avro, > > orc). > > > >> > > > > >> > > > > >> > On Wed, Oct 9, 2019 at 11:08 PM Wes McKinney <wesmck...@gmail.com > > > > > >> wrote: > > > >> > > > > >> > > There are a number of issues worth discussion. > > > >> > > > > > >> > > 1. What is the timeline/plan for Rust implementing a Parquet > > _writer_? > > > >> > > It's OK to be reliant on other libraries in the short term to > > produce > > > >> > > files to test against, but does not strike me as a sustainable > > > >> > > long-term plan. Fixing bugs can be a lot more difficult than it > > needs > > > >> > > to be if you can't write targeted "endogenous" unit tests > > > >> > > > > > >> > > 2. Reproducible data generation > > > >> > > > > > >> > > I think if you're going to test against a pre-generated corpus, > > you > > > >> > > should make sure that generating the corpus is reproducible for > > other > > > >> > > developers (i.e. with a Dockerfile), and can be extended by > > adding new > > > >> > > files or random data generation. > > > >> > > > > > >> > > I additionally would prefer generating the test corpus at test > > time > > > >> > > rather than checking in binary files. If this isn't viable right > > now > > > >> > > we can create an "arrow-rust-crutch" git repository for you to > > stash > > > >> > > binary files until some of these testing scalability issues are > > > >> > > addressed. > > > >> > > > > > >> > > If we're going to spend energy on Parquet integration testing > with > > > >> > > Java, this would be a good opportunity to do the work in a way > > where > > > >> > > the C++ Parquet library can also participate (since we ought to > be > > > >> > > doing integration tests with Java, and we can also read JSON > > files to > > > >> > > Arrow). > > > >> > > > > > >> > > On Tue, Oct 8, 2019 at 11:54 PM Renjie Liu < > > liurenjie2...@gmail.com> > > > >> > > wrote: > > > >> > > > > > > >> > > > On Wed, Oct 9, 2019 at 12:11 PM Andy Grove < > > andygrov...@gmail.com> > > > >> > > wrote: > > > >> > > > > > > >> > > > > I'm very interested in helping to find a solution to this > > because > > > >> we > > > >> > > really > > > >> > > > > do need integration tests for Rust to make sure we're > > compatible > > > >> with > > > >> > > other > > > >> > > > > implementations... there is also the ongoing CI > dockerization > > work > > > >> > > that I > > > >> > > > > feel is related. > > > >> > > > > > > > >> > > > > I haven't looked at the current integration tests yet and > > would > > > >> > > appreciate > > > >> > > > > some pointers on how all of this works (do we have docs?) or > > > >> where to > > > >> > > start > > > >> > > > > looking. > > > >> > > > > > > > >> > > > I have a test in my latest PR: > > > >> https://github.com/apache/arrow/pull/5523 > > > >> > > > And here is the generated data: > > > >> > > > https://github.com/apache/arrow-testing/pull/11 > > > >> > > > As with program to generate these data, it's just a simple > java > > > >> program. > > > >> > > > I'm not sure whether we need to integrate it into arrow. > > > >> > > > > > > >> > > > > > > > >> > > > > I imagine the integration test could follow the approach > that > > > >> Renjie is > > > >> > > > > outlining where we call Java to generate some files and then > > call > > > >> Rust > > > >> > > to > > > >> > > > > parse them? > > > >> > > > > > > > >> > > > > Thanks, > > > >> > > > > > > > >> > > > > Andy. > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > On Tue, Oct 8, 2019 at 9:48 PM Renjie Liu < > > > >> liurenjie2...@gmail.com> > > > >> > > wrote: > > > >> > > > > > > > >> > > > > > Hi: > > > >> > > > > > > > > >> > > > > > I'm developing rust version of reader which reads parquet > > into > > > >> arrow > > > >> > > > > array. > > > >> > > > > > To verify the correct of this reader, I use the following > > > >> approach: > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > 1. Define schema with protobuf. > > > >> > > > > > 2. Generate json data of this schema using other > language > > > >> with > > > >> > > more > > > >> > > > > > sophisticated implementation (e.g. java) > > > >> > > > > > 3. Generate parquet data of this schema using other > > language > > > >> with > > > >> > > more > > > >> > > > > > sophisticated implementation (e.g. java) > > > >> > > > > > 4. Write tests to read json file, and parquet file into > > > >> memory > > > >> > > (arrow > > > >> > > > > > array), then compare json data with arrow data. > > > >> > > > > > > > > >> > > > > > I think with this method we can guarantee the correctness > > of > > > >> arrow > > > >> > > > > reader > > > >> > > > > > because json format is ubiquitous and their implementation > > are > > > >> more > > > >> > > > > stable. > > > >> > > > > > > > > >> > > > > > Any comment is appreciated. > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > -- > > > >> > > > Renjie Liu > > > >> > > > Software Engineer, MVAD > > > >> > > > > > >> > > > > >> > > > > >> > -- > > > >> > Renjie Liu > > > >> > Software Engineer, MVAD > > > >> > > > > > > > > > > -- > > > Renjie Liu > > > Software Engineer, MVAD > > >