@Antoine > By the way, one concern with the monorepo approach: it would slightly > increase Arrow CI times (which are already too large).
A typical CI run in Arrow is taking about 45 minutes: https://travis-ci.org/apache/arrow/builds/410119750 Parquet run takes about 28 https://travis-ci.org/apache/parquet-cpp/builds/410147208 Inevitably we will need to create some kind of bot to run certain builds on-demand based on commit / PR metadata or on request. The slowest build in Arrow (the Arrow C++/Python one) build could be made substantially shorter by moving some of the slower parts (like the Python ASV benchmarks) from being tested every-commit to nightly or on demand. Using ASAN instead of valgrind in Travis would also improve build times (valgrind build could be moved to a nightly exhaustive test run) - Wes On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> I would like to point out that arrow's use of orc is a great example of how >> it would be possible to manage parquet-cpp as a separate codebase. That >> gives me hope that the projects could be managed separately some day. > > Well, I don't know that ORC is the best example. The ORC C++ codebase > features several areas of duplicated logic which could be replaced by > components from the Arrow platform for better platform-wide > interoperability: > > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37 > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh > > ORC's use of symbols from Protocol Buffers was actually a cause of > bugs that we had to fix in Arrow's build system to prevent them from > leaking to third party linkers when statically linked (ORC is only > available for static linking at the moment AFAIK). > > I question whether it's worth the community's time long term to wear > ourselves out defining custom "ports" / virtual interfaces in each > library to plug components together rather than utilizing common > platform APIs. > > - Wes > > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <joshuasto...@gmail.com> > wrote: >> You're point about the constraints of the ASF release process are well >> taken and as a developer who's trying to work in the current environment I >> would be much happier if the codebases were merged. The main issues I worry >> about when you put codebases like these together are: >> >> 1. The delineation of API's become blurred and the code becomes too coupled >> 2. Release of artifacts that are lower in the dependency tree are delayed >> by artifacts higher in the dependency tree >> >> If the project/release management is structured well and someone keeps an >> eye on the coupling, then I don't have any concerns. >> >> I would like to point out that arrow's use of orc is a great example of how >> it would be possible to manage parquet-cpp as a separate codebase. That >> gives me hope that the projects could be managed separately some day. >> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >>> hi Josh, >>> >>> > I can imagine use cases for parquet that don't involve arrow and tying >>> them together seems like the wrong choice. >>> >>> Apache is "Community over Code"; right now it's the same people >>> building these projects -- my argument (which I think you agree with?) >>> is that we should work more closely together until the community grows >>> large enough to support larger-scope process than we have now. As >>> you've seen, our process isn't serving developers of these projects. >>> >>> > I also think build tooling should be pulled into its own codebase. >>> >>> I don't see how this can possibly be practical taking into >>> consideration the constraints imposed by the combination of the GitHub >>> platform and the ASF release process. I'm all for being idealistic, >>> but right now we need to be practical. Unless we can devise a >>> practical procedure that can accommodate at least 1 patch per day >>> which may touch both code and build system simultaneously without >>> being a hindrance to contributor or maintainer, I don't see how we can >>> move forward. >>> >>> > That being said, I think it makes sense to merge the codebases in the >>> short term with the express purpose of separating them in the near term. >>> >>> I would agree but only if separation can be demonstrated to be >>> practical and result in net improvements in productivity and community >>> growth. I think experience has clearly demonstrated that the current >>> separation is impractical, and is causing problems. >>> >>> Per Julian's and Ted's comments, I think we need to consider >>> development process and ASF releases separately. My argument is as >>> follows: >>> >>> * Monorepo for development (for practicality) >>> * Releases structured according to the desires of the PMCs >>> >>> - Wes >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuasto...@gmail.com> >>> wrote: >>> > I recently worked on an issue that had to be implemented in parquet-cpp >>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585, >>> > ARROW-2586). I found the circular dependencies confusing and hard to work >>> > with. For example, I still have a PR open in parquet-cpp (created on May >>> > 10) because of a PR that it depended on in arrow that was recently >>> merged. >>> > I couldn't even address any CI issues in the PR because the change in >>> arrow >>> > was not yet in master. In a separate PR, I changed the >>> run_clang_format.py >>> > script in the arrow project only to find out later that there was an >>> exact >>> > copy of it in parquet-cpp. >>> > >>> > However, I don't think merging the codebases makes sense in the long >>> term. >>> > I can imagine use cases for parquet that don't involve arrow and tying >>> them >>> > together seems like the wrong choice. There will be other formats that >>> > arrow needs to support that will be kept separate (e.g. - Orc), so I >>> don't >>> > see why parquet should be special. I also think build tooling should be >>> > pulled into its own codebase. GNU has had a long history of developing >>> open >>> > source C/C++ projects that way and made projects like >>> > autoconf/automake/make to support them. I don't think CI is a good >>> > counter-example since there have been lots of successful open source >>> > projects that have used nightly build systems that pinned versions of >>> > dependent software. >>> > >>> > That being said, I think it makes sense to merge the codebases in the >>> short >>> > term with the express purpose of separating them in the near term. My >>> > reasoning is as follows. By putting the codebases together, you can more >>> > easily delineate the boundaries between the API's with a single PR. >>> Second, >>> > it will force the build tooling to converge instead of diverge, which has >>> > already happened. Once the boundaries and tooling have been sorted out, >>> it >>> > should be easy to separate them back into their own codebases. >>> > >>> > If the codebases are merged, I would ask that the C++ codebases for arrow >>> > be separated from other languages. Looking at it from the perspective of >>> a >>> > parquet-cpp library user, having a dependency on Java is a large tax to >>> pay >>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0 >>> > release of arrow, many of which were holding up the release. I hope that >>> > seems like a reasonable compromise, and I think it will help reduce the >>> > complexity of the build/release tooling. >>> > >>> > >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <ted.dunn...@gmail.com> >>> wrote: >>> > >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <wesmck...@gmail.com> >>> wrote: >>> >> >>> >> > >>> >> > > The community will be less willing to accept large >>> >> > > changes that require multiple rounds of patches for stability and >>> API >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community >>> took >>> >> a >>> >> > > significantly long time for the very same reason. >>> >> > >>> >> > Please don't use bad experiences from another open source community as >>> >> > leverage in this discussion. I'm sorry that things didn't go the way >>> >> > you wanted in Apache Hadoop but this is a distinct community which >>> >> > happens to operate under a similar open governance model. >>> >> >>> >> >>> >> There are some more radical and community building options as well. Take >>> >> the subversion project as a precedent. With subversion, any Apache >>> >> committer can request and receive a commit bit on some large fraction of >>> >> subversion. >>> >> >>> >> So why not take this a bit further and give every parquet committer a >>> >> commit bit in Arrow? Or even make them be first class committers in >>> Arrow? >>> >> Possibly even make it policy that every Parquet committer who asks will >>> be >>> >> given committer status in Arrow. >>> >> >>> >> That relieves a lot of the social anxiety here. Parquet committers >>> can't be >>> >> worried at that point whether their patches will get merged; they can >>> just >>> >> merge them. Arrow shouldn't worry much about inviting in the Parquet >>> >> committers. After all, Arrow already depends a lot on parquet so why not >>> >> invite them in? >>> >> >>>