Could this work as each module gets configured as sub-git repots. Top level build tool go into each sub-repo, pick the correct release version to test. Tests in Python is dependent on cpp sub-repo to ensure the API still pass.
This should be the best of both worlds, if sub-repo are supposed option. --Donald E. Foss On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com> wrote: > I dislike the current build system complications as well. > > However, in my opinion, combining the code bases will severely impact the > progress of the parquet-cpp project and implicitly the progress of the > entire parquet project. > Combining would have made much more sense if parquet-cpp is a mature > project and codebase. But parquet-cpp (and the entire parquet project) is > evolving continuously with new features being added including bloom > filters, column encryption, and indexes. > > If the two code bases merged, it will be much more difficult to contribute > to the parquet-cpp project since now Arrow bindings have to be supported as > well. Please correct me if I am wrong here. > > Out of the two evils, I think handling the build system, packaging > duplication is much more manageable since they are quite stable at this > point. > > Regarding "* API changes cause awkward release coordination issues between > Arrow and Parquet". Can we make minor releases for parquet-cpp (with API > changes needed) as and when Arrow is released? > > Regarding "we maintain a Arrow conversion code in parquet-cpp for > converting between Arrow columnar memory format and Parquet". Can this be > moved to the Arrow project and expose the more stable low-level APIs in > parquet-cpp? > > I am also curious if the Arrow and Parquet Java implementations have > similar API compatibility issues. > > > On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi folks, > > > > We've been struggling for quite some time with the development > > workflow between the Arrow and Parquet C++ (and Python) codebases. > > > > To explain the root issues: > > > > * parquet-cpp depends on "platform code" in Apache Arrow; this > > includes file interfaces, memory management, miscellaneous algorithms > > (e.g. dictionary encoding), etc. Note that before this "platform" > > dependency was introduced, there was significant duplicated code > > between these codebases and incompatible abstract interfaces for > > things like files > > > > * we maintain a Arrow conversion code in parquet-cpp for converting > > between Arrow columnar memory format and Parquet > > > > * we maintain Python bindings for parquet-cpp + Arrow interop in > > Apache Arrow. This introduces a circular dependency into our CI. > > > > * Substantial portions of our CMake build system and related tooling > > are duplicated between the Arrow and Parquet repos > > > > * API changes cause awkward release coordination issues between Arrow > > and Parquet > > > > I believe the best way to remedy the situation is to adopt a > > "Community over Code" approach and find a way for the Parquet and > > Arrow C++ development communities to operate out of the same code > > repository, i.e. the apache/arrow git repository. > > > > This would bring major benefits: > > > > * Shared CMake build infrastructure, developer tools, and CI > > infrastructure (Parquet is already being built as a dependency in > > Arrow's CI systems) > > > > * Share packaging and release management infrastructure > > > > * Reduce / eliminate problems due to API changes (where we currently > > introduce breakage into our CI workflow when there is a breaking / > > incompatible change) > > > > * Arrow releases would include a coordinated snapshot of the Parquet > > implementation as it stands > > > > Continuing with the status quo has become unsatisfactory to me and as > > a result I've become less motivated to work on the parquet-cpp > > codebase. > > > > The only Parquet C++ committer who is not an Arrow committer is Deepak > > Majeti. I think the issue of commit privileges could be resolved > > without too much difficulty or time. > > > > I also think if it is truly necessary that the Apache Parquet > > community could create release scripts to cut a miniml versioned > > Apache Parquet C++ release if that is deemed truly necessary. > > > > I know that some people are wary of monorepos and megaprojects, but as > > an example TensorFlow is at least 10 times as large of a projects in > > terms of LOCs and number of different platform components, and it > > seems to be getting along just fine. I think we should be able to work > > together as a community to function just as well. > > > > Interested in the opinions of others, and any other ideas for > > practical solutions to the above problems. > > > > Thanks, > > Wes > > > > > -- > regards, > Deepak Majeti >