@Wes My observation is that most of the parquet-cpp contributors you listed that overlap with the Arrow community mainly contribute to the Arrow bindings(parquet::arrow layer)/platform API changes in the parquet-cpp repo. Very few of them review/contribute patches to the parquet-cpp core.
I believe improvements to the parquet-cpp core will be negatively impacted since merging the parquet-cpp and arrow-cpp repos will increase the barrier of entry to new contributors interested in the parquet-cpp core. The current extensions to the parquet-cpp core related to bloom-filters, and column encryption are all being done by first-time contributors. If you believe there will be new interest in the parquet-cpp core with the mono-repo approach, I am all up for it. On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pcmor...@gmail.com> wrote: > I do not claim to have insight into parquet-cpp development. However, from > our experience developing Ray, I can say that the monorepo approach (for > Ray) has improved things a lot. Before we tried various schemes to split > the project into multiple repos, but the build system and test > infrastructure duplications and overhead from synchronizing changes slowed > development down significantly (and fixing bugs that touch the subrepos and > the main repo is inconvenient). > > Also the decision to put arrow and parquet-cpp into a common repo is > independent of how tightly coupled the two projects are (and there could be > a matrix entry in travis which tests that PRs keep them decoupled, or > rather that they both just depend on a small common "base"). Google and > Facebook demonstrate such independence by having many many projects in the > same repo of course. It would be great if the open source community would > move more into this direction too I think. > > Best, > Philipp. > > On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Donald, > > > > This would make things worse, not better. Code changes routinely > > involve changes to the build system, and so you could be talking about > > having to making changes to 2 or 3 git repositories as the result of a > > single new feature or bug fix. There isn't really a cross-repo CI > > solution available > > > > I've seen some approaches to the monorepo problem using multiple git > > repositories, such as > > > > https://github.com/twosigma/git-meta > > > > Until something like this has first class support by the GitHub > > platform and its CI services (Travis CI, Appveyor), I don't think it > > will work for us. > > > > - Wes > > > > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com> > > wrote: > > > Could this work as each module gets configured as sub-git repots. Top > > level > > > build tool go into each sub-repo, pick the correct release version to > > test. > > > Tests in Python is dependent on cpp sub-repo to ensure the API still > > pass. > > > > > > This should be the best of both worlds, if sub-repo are supposed > option. > > > > > > --Donald E. Foss > > > > > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com> > > > wrote: > > > > > >> I dislike the current build system complications as well. > > >> > > >> However, in my opinion, combining the code bases will severely impact > > the > > >> progress of the parquet-cpp project and implicitly the progress of the > > >> entire parquet project. > > >> Combining would have made much more sense if parquet-cpp is a mature > > >> project and codebase. But parquet-cpp (and the entire parquet > project) > > is > > >> evolving continuously with new features being added including bloom > > >> filters, column encryption, and indexes. > > >> > > >> If the two code bases merged, it will be much more difficult to > > contribute > > >> to the parquet-cpp project since now Arrow bindings have to be > > supported as > > >> well. Please correct me if I am wrong here. > > >> > > >> Out of the two evils, I think handling the build system, packaging > > >> duplication is much more manageable since they are quite stable at > this > > >> point. > > >> > > >> Regarding "* API changes cause awkward release coordination issues > > between > > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with > API > > >> changes needed) as and when Arrow is released? > > >> > > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for > > >> converting between Arrow columnar memory format and Parquet". Can this > > be > > >> moved to the Arrow project and expose the more stable low-level APIs > in > > >> parquet-cpp? > > >> > > >> I am also curious if the Arrow and Parquet Java implementations have > > >> similar API compatibility issues. > > >> > > >> > > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > >> > > >> > hi folks, > > >> > > > >> > We've been struggling for quite some time with the development > > >> > workflow between the Arrow and Parquet C++ (and Python) codebases. > > >> > > > >> > To explain the root issues: > > >> > > > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this > > >> > includes file interfaces, memory management, miscellaneous > algorithms > > >> > (e.g. dictionary encoding), etc. Note that before this "platform" > > >> > dependency was introduced, there was significant duplicated code > > >> > between these codebases and incompatible abstract interfaces for > > >> > things like files > > >> > > > >> > * we maintain a Arrow conversion code in parquet-cpp for converting > > >> > between Arrow columnar memory format and Parquet > > >> > > > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in > > >> > Apache Arrow. This introduces a circular dependency into our CI. > > >> > > > >> > * Substantial portions of our CMake build system and related tooling > > >> > are duplicated between the Arrow and Parquet repos > > >> > > > >> > * API changes cause awkward release coordination issues between > Arrow > > >> > and Parquet > > >> > > > >> > I believe the best way to remedy the situation is to adopt a > > >> > "Community over Code" approach and find a way for the Parquet and > > >> > Arrow C++ development communities to operate out of the same code > > >> > repository, i.e. the apache/arrow git repository. > > >> > > > >> > This would bring major benefits: > > >> > > > >> > * Shared CMake build infrastructure, developer tools, and CI > > >> > infrastructure (Parquet is already being built as a dependency in > > >> > Arrow's CI systems) > > >> > > > >> > * Share packaging and release management infrastructure > > >> > > > >> > * Reduce / eliminate problems due to API changes (where we currently > > >> > introduce breakage into our CI workflow when there is a breaking / > > >> > incompatible change) > > >> > > > >> > * Arrow releases would include a coordinated snapshot of the Parquet > > >> > implementation as it stands > > >> > > > >> > Continuing with the status quo has become unsatisfactory to me and > as > > >> > a result I've become less motivated to work on the parquet-cpp > > >> > codebase. > > >> > > > >> > The only Parquet C++ committer who is not an Arrow committer is > Deepak > > >> > Majeti. I think the issue of commit privileges could be resolved > > >> > without too much difficulty or time. > > >> > > > >> > I also think if it is truly necessary that the Apache Parquet > > >> > community could create release scripts to cut a miniml versioned > > >> > Apache Parquet C++ release if that is deemed truly necessary. > > >> > > > >> > I know that some people are wary of monorepos and megaprojects, but > as > > >> > an example TensorFlow is at least 10 times as large of a projects in > > >> > terms of LOCs and number of different platform components, and it > > >> > seems to be getting along just fine. I think we should be able to > work > > >> > together as a community to function just as well. > > >> > > > >> > Interested in the opinions of others, and any other ideas for > > >> > practical solutions to the above problems. > > >> > > > >> > Thanks, > > >> > Wes > > >> > > > >> > > >> > > >> -- > > >> regards, > > >> Deepak Majeti > > >> > > > -- regards, Deepak Majeti