You're point about the constraints of the ASF release process are well taken and as a developer who's trying to work in the current environment I would be much happier if the codebases were merged. The main issues I worry about when you put codebases like these together are:
1. The delineation of API's become blurred and the code becomes too coupled 2. Release of artifacts that are lower in the dependency tree are delayed by artifacts higher in the dependency tree If the project/release management is structured well and someone keeps an eye on the coupling, then I don't have any concerns. I would like to point out that arrow's use of orc is a great example of how it would be possible to manage parquet-cpp as a separate codebase. That gives me hope that the projects could be managed separately some day. On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Josh, > > > I can imagine use cases for parquet that don't involve arrow and tying > them together seems like the wrong choice. > > Apache is "Community over Code"; right now it's the same people > building these projects -- my argument (which I think you agree with?) > is that we should work more closely together until the community grows > large enough to support larger-scope process than we have now. As > you've seen, our process isn't serving developers of these projects. > > > I also think build tooling should be pulled into its own codebase. > > I don't see how this can possibly be practical taking into > consideration the constraints imposed by the combination of the GitHub > platform and the ASF release process. I'm all for being idealistic, > but right now we need to be practical. Unless we can devise a > practical procedure that can accommodate at least 1 patch per day > which may touch both code and build system simultaneously without > being a hindrance to contributor or maintainer, I don't see how we can > move forward. > > > That being said, I think it makes sense to merge the codebases in the > short term with the express purpose of separating them in the near term. > > I would agree but only if separation can be demonstrated to be > practical and result in net improvements in productivity and community > growth. I think experience has clearly demonstrated that the current > separation is impractical, and is causing problems. > > Per Julian's and Ted's comments, I think we need to consider > development process and ASF releases separately. My argument is as > follows: > > * Monorepo for development (for practicality) > * Releases structured according to the desires of the PMCs > > - Wes > > On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuasto...@gmail.com> > wrote: > > I recently worked on an issue that had to be implemented in parquet-cpp > > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585, > > ARROW-2586). I found the circular dependencies confusing and hard to work > > with. For example, I still have a PR open in parquet-cpp (created on May > > 10) because of a PR that it depended on in arrow that was recently > merged. > > I couldn't even address any CI issues in the PR because the change in > arrow > > was not yet in master. In a separate PR, I changed the > run_clang_format.py > > script in the arrow project only to find out later that there was an > exact > > copy of it in parquet-cpp. > > > > However, I don't think merging the codebases makes sense in the long > term. > > I can imagine use cases for parquet that don't involve arrow and tying > them > > together seems like the wrong choice. There will be other formats that > > arrow needs to support that will be kept separate (e.g. - Orc), so I > don't > > see why parquet should be special. I also think build tooling should be > > pulled into its own codebase. GNU has had a long history of developing > open > > source C/C++ projects that way and made projects like > > autoconf/automake/make to support them. I don't think CI is a good > > counter-example since there have been lots of successful open source > > projects that have used nightly build systems that pinned versions of > > dependent software. > > > > That being said, I think it makes sense to merge the codebases in the > short > > term with the express purpose of separating them in the near term. My > > reasoning is as follows. By putting the codebases together, you can more > > easily delineate the boundaries between the API's with a single PR. > Second, > > it will force the build tooling to converge instead of diverge, which has > > already happened. Once the boundaries and tooling have been sorted out, > it > > should be easy to separate them back into their own codebases. > > > > If the codebases are merged, I would ask that the C++ codebases for arrow > > be separated from other languages. Looking at it from the perspective of > a > > parquet-cpp library user, having a dependency on Java is a large tax to > pay > > if you don't need it. For example, there were 25 JIRA's in the 0.10.0 > > release of arrow, many of which were holding up the release. I hope that > > seems like a reasonable compromise, and I think it will help reduce the > > complexity of the build/release tooling. > > > > > > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >> > > >> > > The community will be less willing to accept large > >> > > changes that require multiple rounds of patches for stability and > API > >> > > convergence. Our contributions to Libhdfs++ in the HDFS community > took > >> a > >> > > significantly long time for the very same reason. > >> > > >> > Please don't use bad experiences from another open source community as > >> > leverage in this discussion. I'm sorry that things didn't go the way > >> > you wanted in Apache Hadoop but this is a distinct community which > >> > happens to operate under a similar open governance model. > >> > >> > >> There are some more radical and community building options as well. Take > >> the subversion project as a precedent. With subversion, any Apache > >> committer can request and receive a commit bit on some large fraction of > >> subversion. > >> > >> So why not take this a bit further and give every parquet committer a > >> commit bit in Arrow? Or even make them be first class committers in > Arrow? > >> Possibly even make it policy that every Parquet committer who asks will > be > >> given committer status in Arrow. > >> > >> That relieves a lot of the social anxiety here. Parquet committers > can't be > >> worried at that point whether their patches will get merged; they can > just > >> merge them. Arrow shouldn't worry much about inviting in the Parquet > >> committers. After all, Arrow already depends a lot on parquet so why not > >> invite them in? > >> >