Thanks Tim. Indeed, it's not very simple. Just today Antoine cleaned up some platform code intending to improve the performance of bit-packing in Parquet writes, and we resulted with 2 interdependent PRs
* https://github.com/apache/parquet-cpp/pull/483 * https://github.com/apache/arrow/pull/2355 Changes that impact the Python interface to Parquet are even more complex. Adding options to Arrow's CMake build system to only build Parquet-related code and dependencies (in a monorepo framework) would not be difficult, and amount to writing "make parquet". See e.g. https://stackoverflow.com/a/17201375. The desired commands to build and install the Parquet core libraries and their dependencies would be: ninja parquet && ninja install - Wes On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong <tarmstr...@cloudera.com.invalid> wrote: > I don't have a direct stake in this beyond wanting to see Parquet be > successful, but I thought I'd give my two cents. > > For me, the thing that makes the biggest difference in contributing to a > new codebase is the number of steps in the workflow for writing, testing, > posting and iterating on a commit and also the number of opportunities for > missteps. The size of the repo and build/test times matter but are > secondary so long as the workflow is simple and reliable. > > I don't really know what the current state of things is, but it sounds like > it's not as simple as check out -> build -> test if you're doing a > cross-repo change. Circular dependencies are a real headache. > > On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote: > >> hi, >> >> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <majeti.dee...@gmail.com> >> wrote: >> > I think the circular dependency can be broken if we build a new library >> for >> > the platform code. This will also make it easy for other projects such as >> > ORC to use it. >> > I also remember your proposal a while ago of having a separate project >> for >> > the platform code. That project can live in the arrow repo. However, one >> > has to clone the entire apache arrow repo but can just build the platform >> > code. This will be temporary until we can find a new home for it. >> > >> > The dependency will look like: >> > libarrow(arrow core / bindings) <- libparquet (parquet core) <- >> > libplatform(platform api) >> > >> > CI workflow will clone the arrow project twice, once for the platform >> > library and once for the arrow-core/bindings library. >> >> This seems like an interesting proposal; the best place to work toward >> this goal (if it is even possible; the build system interactions and >> ASF release management are the hard problems) is to have all of the >> code in a single repository. ORC could already be using Arrow if it >> wanted, but the ORC contributors aren't active in Arrow. >> >> > >> > There is no doubt that the collaborations between the Arrow and Parquet >> > communities so far have been very successful. >> > The reason to maintain this relationship moving forward is to continue to >> > reap the mutual benefits. >> > We should continue to take advantage of sharing code as well. However, I >> > don't see any code sharing opportunities between arrow-core and the >> > parquet-core. Both have different functions. >> >> I think you mean the Arrow columnar format. The Arrow columnar format >> is only one part of a project that has become quite large already >> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development- >> platform-for-inmemory-data-105427919). >> >> > >> > We are at a point where the parquet-cpp public API is pretty stable. We >> > already passed that difficult stage. My take at arrow and parquet is to >> > keep them nimble since we can. >> >> I believe that parquet-core has progress to make yet ahead of it. We >> have done little work in asynchronous IO and concurrency which would >> yield both improved read and write throughput. This aligns well with >> other concurrency and async-IO work planned in the Arrow platform. I >> believe that more development will happen on parquet-core once the >> development process issues are resolved by having a single codebase, >> single build system, and a single CI framework. >> >> I have some gripes about design decisions made early in parquet-cpp, >> like the use of C++ exceptions. So while "stability" is a reasonable >> goal I think we should still be open to making significant changes in >> the interest of long term progress. >> >> Having now worked on these projects for more than 2 and a half years >> and the most frequent contributor to both codebases, I'm sadly far >> past the "breaking point" and not willing to continue contributing in >> a significant way to parquet-cpp if the projects remained structured >> as they are now. It's hampering progress and not serving the >> community. >> >> - Wes >> >> > >> > >> > >> > >> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > >> >> > The current Arrow adaptor code for parquet should live in the arrow >> >> repo. That will remove a majority of the dependency issues. Joshua's >> work >> >> would not have been blocked in parquet-cpp if that adapter was in the >> arrow >> >> repo. This will be similar to the ORC adaptor. >> >> >> >> This has been suggested before, but I don't see how it would alleviate >> >> any issues because of the significant dependencies on other parts of >> >> the Arrow codebase. What you are proposing is: >> >> >> >> - (Arrow) arrow platform >> >> - (Parquet) parquet core >> >> - (Arrow) arrow columnar-parquet adapter interface >> >> - (Arrow) Python bindings >> >> >> >> To make this work, somehow Arrow core / libarrow would have to be >> >> built before invoking the Parquet core part of the build system. You >> >> would need to pass dependent targets across different CMake build >> >> systems; I don't know if it's possible (I spent some time looking into >> >> it earlier this year). This is what I meant by the lack of a "concrete >> >> and actionable plan". The only thing that would really work would be >> >> for the Parquet core to be "included" in the Arrow build system >> >> somehow rather than using ExternalProject. Currently Parquet builds >> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build >> >> system because it's only depended upon by the Python bindings. >> >> >> >> And even if a solution could be devised, it would not wholly resolve >> >> the CI workflow issues. >> >> >> >> You could make Parquet completely independent of the Arrow codebase, >> >> but at that point there is little reason to maintain a relationship >> >> between the projects or their communities. We have spent a great deal >> >> of effort refactoring the two projects to enable as much code sharing >> >> as there is now. >> >> >> >> - Wes >> >> >> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <wesmck...@gmail.com> >> wrote: >> >> >> If you still strongly feel that the only way forward is to clone the >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two >> >> parquet-cpp repos is no way a better approach. >> >> > >> >> > Yes, indeed. In my view, the next best option after a monorepo is to >> >> > fork. That would obviously be a bad outcome for the community. >> >> > >> >> > It doesn't look like I will be able to convince you that a monorepo is >> >> > a good idea; what I would ask instead is that you be willing to give >> >> > it a shot, and if it turns out in the way you're describing (which I >> >> > don't think it will) then I suggest that we fork at that point. >> >> > >> >> > - Wes >> >> > >> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti < >> majeti.dee...@gmail.com> >> >> wrote: >> >> >> Wes, >> >> >> >> >> >> Unfortunately, I cannot show you any practical fact-based problems >> of a >> >> >> non-existent Arrow-Parquet mono-repo. >> >> >> Bringing in related Apache community experiences are more meaningful >> >> than >> >> >> how mono-repos work at Google and other big organizations. >> >> >> We solely depend on volunteers and cannot hire full-time developers. >> >> >> You are very well aware of how difficult it has been to find more >> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low >> >> >> contribution rate to its core components. >> >> >> >> >> >> We should target to ensure that new volunteers who want to contribute >> >> >> bug-fixes/features should spend the least amount of time in figuring >> out >> >> >> the project repo. We can never come up with an automated build system >> >> that >> >> >> caters to every possible environment. >> >> >> My only concern is if the mono-repo will make it harder for new >> >> developers >> >> >> to work on parquet-cpp core just due to the additional code, build >> and >> >> test >> >> >> dependencies. >> >> >> I am not saying that the Arrow community/committers will be less >> >> >> co-operative. >> >> >> I just don't think the mono-repo structure model will be sustainable >> in >> >> an >> >> >> open source community unless there are long-term vested interests. We >> >> can't >> >> >> predict that. >> >> >> >> >> >> The current circular dependency problems between Arrow and Parquet >> is a >> >> >> major problem for the community and it is important. >> >> >> >> >> >> The current Arrow adaptor code for parquet should live in the arrow >> >> repo. >> >> >> That will remove a majority of the dependency issues. >> >> >> Joshua's work would not have been blocked in parquet-cpp if that >> adapter >> >> >> was in the arrow repo. This will be similar to the ORC adaptor. >> >> >> >> >> >> The platform API code is pretty stable at this point. Minor changes >> in >> >> the >> >> >> future to this code should not be the main reason to combine the >> arrow >> >> >> parquet repos. >> >> >> >> >> >> " >> >> >> *I question whether it's worth the community's time long term to >> wear* >> >> >> >> >> >> >> >> >> *ourselves out defining custom "ports" / virtual interfaces in >> >> eachlibrary >> >> >> to plug components together rather than utilizing commonplatform >> APIs.*" >> >> >> >> >> >> My answer to your question below would be "Yes". >> Modularity/separation >> >> is >> >> >> very important in an open source community where priorities of >> >> contributors >> >> >> are often short term. >> >> >> The retention is low and therefore the acquisition costs should be >> low >> >> as >> >> >> well. This is the community over code approach according to me. Minor >> >> code >> >> >> duplication is not a deal breaker. >> >> >> ORC, Parquet, Arrow, etc. are all different components in the big >> data >> >> >> space serving their own functions. >> >> >> >> >> >> If you still strongly feel that the only way forward is to clone the >> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having >> two >> >> >> parquet-cpp repos is no way a better approach. >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <wesmck...@gmail.com> >> >> wrote: >> >> >> >> >> >>> @Antoine >> >> >>> >> >> >>> > By the way, one concern with the monorepo approach: it would >> slightly >> >> >>> increase Arrow CI times (which are already too large). >> >> >>> >> >> >>> A typical CI run in Arrow is taking about 45 minutes: >> >> >>> https://travis-ci.org/apache/arrow/builds/410119750 >> >> >>> >> >> >>> Parquet run takes about 28 >> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208 >> >> >>> >> >> >>> Inevitably we will need to create some kind of bot to run certain >> >> >>> builds on-demand based on commit / PR metadata or on request. >> >> >>> >> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be >> >> >>> made substantially shorter by moving some of the slower parts (like >> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly >> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also >> >> >>> improve build times (valgrind build could be moved to a nightly >> >> >>> exhaustive test run) >> >> >>> >> >> >>> - Wes >> >> >>> >> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com >> > >> >> >>> wrote: >> >> >>> >> I would like to point out that arrow's use of orc is a great >> >> example of >> >> >>> how it would be possible to manage parquet-cpp as a separate >> codebase. >> >> That >> >> >>> gives me hope that the projects could be managed separately some >> day. >> >> >>> > >> >> >>> > Well, I don't know that ORC is the best example. The ORC C++ >> codebase >> >> >>> > features several areas of duplicated logic which could be >> replaced by >> >> >>> > components from the Arrow platform for better platform-wide >> >> >>> > interoperability: >> >> >>> > >> >> >>> > >> >> >>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ >> orc/OrcFile.hh#L37 >> >> >>> > >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh >> >> >>> > >> >> >>> >> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ >> orc/MemoryPool.hh >> >> >>> > >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh >> >> >>> > >> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/ >> OutputStream.hh >> >> >>> > >> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of >> >> >>> > bugs that we had to fix in Arrow's build system to prevent them >> from >> >> >>> > leaking to third party linkers when statically linked (ORC is only >> >> >>> > available for static linking at the moment AFAIK). >> >> >>> > >> >> >>> > I question whether it's worth the community's time long term to >> wear >> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each >> >> >>> > library to plug components together rather than utilizing common >> >> >>> > platform APIs. >> >> >>> > >> >> >>> > - Wes >> >> >>> > >> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck < >> >> joshuasto...@gmail.com> >> >> >>> wrote: >> >> >>> >> You're point about the constraints of the ASF release process are >> >> well >> >> >>> >> taken and as a developer who's trying to work in the current >> >> >>> environment I >> >> >>> >> would be much happier if the codebases were merged. The main >> issues >> >> I >> >> >>> worry >> >> >>> >> about when you put codebases like these together are: >> >> >>> >> >> >> >>> >> 1. The delineation of API's become blurred and the code becomes >> too >> >> >>> coupled >> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are >> >> >>> delayed >> >> >>> >> by artifacts higher in the dependency tree >> >> >>> >> >> >> >>> >> If the project/release management is structured well and someone >> >> keeps >> >> >>> an >> >> >>> >> eye on the coupling, then I don't have any concerns. >> >> >>> >> >> >> >>> >> I would like to point out that arrow's use of orc is a great >> >> example of >> >> >>> how >> >> >>> >> it would be possible to manage parquet-cpp as a separate >> codebase. >> >> That >> >> >>> >> gives me hope that the projects could be managed separately some >> >> day. >> >> >>> >> >> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney < >> wesmck...@gmail.com> >> >> >>> wrote: >> >> >>> >> >> >> >>> >>> hi Josh, >> >> >>> >>> >> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow >> and >> >> >>> tying >> >> >>> >>> them together seems like the wrong choice. >> >> >>> >>> >> >> >>> >>> Apache is "Community over Code"; right now it's the same people >> >> >>> >>> building these projects -- my argument (which I think you agree >> >> with?) >> >> >>> >>> is that we should work more closely together until the community >> >> grows >> >> >>> >>> large enough to support larger-scope process than we have now. >> As >> >> >>> >>> you've seen, our process isn't serving developers of these >> >> projects. >> >> >>> >>> >> >> >>> >>> > I also think build tooling should be pulled into its own >> >> codebase. >> >> >>> >>> >> >> >>> >>> I don't see how this can possibly be practical taking into >> >> >>> >>> consideration the constraints imposed by the combination of the >> >> GitHub >> >> >>> >>> platform and the ASF release process. I'm all for being >> idealistic, >> >> >>> >>> but right now we need to be practical. Unless we can devise a >> >> >>> >>> practical procedure that can accommodate at least 1 patch per >> day >> >> >>> >>> which may touch both code and build system simultaneously >> without >> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how >> we >> >> can >> >> >>> >>> move forward. >> >> >>> >>> >> >> >>> >>> > That being said, I think it makes sense to merge the codebases >> >> in the >> >> >>> >>> short term with the express purpose of separating them in the >> near >> >> >>> term. >> >> >>> >>> >> >> >>> >>> I would agree but only if separation can be demonstrated to be >> >> >>> >>> practical and result in net improvements in productivity and >> >> community >> >> >>> >>> growth. I think experience has clearly demonstrated that the >> >> current >> >> >>> >>> separation is impractical, and is causing problems. >> >> >>> >>> >> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider >> >> >>> >>> development process and ASF releases separately. My argument is >> as >> >> >>> >>> follows: >> >> >>> >>> >> >> >>> >>> * Monorepo for development (for practicality) >> >> >>> >>> * Releases structured according to the desires of the PMCs >> >> >>> >>> >> >> >>> >>> - Wes >> >> >>> >>> >> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck < >> >> joshuasto...@gmail.com >> >> >>> > >> >> >>> >>> wrote: >> >> >>> >>> > I recently worked on an issue that had to be implemented in >> >> >>> parquet-cpp >> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow >> >> (ARROW-2585, >> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and >> >> hard to >> >> >>> work >> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp >> >> (created on >> >> >>> May >> >> >>> >>> > 10) because of a PR that it depended on in arrow that was >> >> recently >> >> >>> >>> merged. >> >> >>> >>> > I couldn't even address any CI issues in the PR because the >> >> change in >> >> >>> >>> arrow >> >> >>> >>> > was not yet in master. In a separate PR, I changed the >> >> >>> >>> run_clang_format.py >> >> >>> >>> > script in the arrow project only to find out later that there >> >> was an >> >> >>> >>> exact >> >> >>> >>> > copy of it in parquet-cpp. >> >> >>> >>> > >> >> >>> >>> > However, I don't think merging the codebases makes sense in >> the >> >> long >> >> >>> >>> term. >> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow >> and >> >> >>> tying >> >> >>> >>> them >> >> >>> >>> > together seems like the wrong choice. There will be other >> formats >> >> >>> that >> >> >>> >>> > arrow needs to support that will be kept separate (e.g. - >> Orc), >> >> so I >> >> >>> >>> don't >> >> >>> >>> > see why parquet should be special. I also think build tooling >> >> should >> >> >>> be >> >> >>> >>> > pulled into its own codebase. GNU has had a long history of >> >> >>> developing >> >> >>> >>> open >> >> >>> >>> > source C/C++ projects that way and made projects like >> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a >> >> good >> >> >>> >>> > counter-example since there have been lots of successful open >> >> source >> >> >>> >>> > projects that have used nightly build systems that pinned >> >> versions of >> >> >>> >>> > dependent software. >> >> >>> >>> > >> >> >>> >>> > That being said, I think it makes sense to merge the codebases >> >> in the >> >> >>> >>> short >> >> >>> >>> > term with the express purpose of separating them in the near >> >> term. >> >> >>> My >> >> >>> >>> > reasoning is as follows. By putting the codebases together, >> you >> >> can >> >> >>> more >> >> >>> >>> > easily delineate the boundaries between the API's with a >> single >> >> PR. >> >> >>> >>> Second, >> >> >>> >>> > it will force the build tooling to converge instead of >> diverge, >> >> >>> which has >> >> >>> >>> > already happened. Once the boundaries and tooling have been >> >> sorted >> >> >>> out, >> >> >>> >>> it >> >> >>> >>> > should be easy to separate them back into their own codebases. >> >> >>> >>> > >> >> >>> >>> > If the codebases are merged, I would ask that the C++ >> codebases >> >> for >> >> >>> arrow >> >> >>> >>> > be separated from other languages. Looking at it from the >> >> >>> perspective of >> >> >>> >>> a >> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a >> large >> >> tax >> >> >>> to >> >> >>> >>> pay >> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the >> >> 0.10.0 >> >> >>> >>> > release of arrow, many of which were holding up the release. I >> >> hope >> >> >>> that >> >> >>> >>> > seems like a reasonable compromise, and I think it will help >> >> reduce >> >> >>> the >> >> >>> >>> > complexity of the build/release tooling. >> >> >>> >>> > >> >> >>> >>> > >> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning < >> >> ted.dunn...@gmail.com> >> >> >>> >>> wrote: >> >> >>> >>> > >> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney < >> >> wesmck...@gmail.com> >> >> >>> >>> wrote: >> >> >>> >>> >> >> >> >>> >>> >> > >> >> >>> >>> >> > > The community will be less willing to accept large >> >> >>> >>> >> > > changes that require multiple rounds of patches for >> >> stability >> >> >>> and >> >> >>> >>> API >> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS >> >> >>> community >> >> >>> >>> took >> >> >>> >>> >> a >> >> >>> >>> >> > > significantly long time for the very same reason. >> >> >>> >>> >> > >> >> >>> >>> >> > Please don't use bad experiences from another open source >> >> >>> community as >> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't >> go >> >> the >> >> >>> way >> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct >> community >> >> which >> >> >>> >>> >> > happens to operate under a similar open governance model. >> >> >>> >>> >> >> >> >>> >>> >> >> >> >>> >>> >> There are some more radical and community building options as >> >> well. >> >> >>> Take >> >> >>> >>> >> the subversion project as a precedent. With subversion, any >> >> Apache >> >> >>> >>> >> committer can request and receive a commit bit on some large >> >> >>> fraction of >> >> >>> >>> >> subversion. >> >> >>> >>> >> >> >> >>> >>> >> So why not take this a bit further and give every parquet >> >> committer >> >> >>> a >> >> >>> >>> >> commit bit in Arrow? Or even make them be first class >> >> committers in >> >> >>> >>> Arrow? >> >> >>> >>> >> Possibly even make it policy that every Parquet committer who >> >> asks >> >> >>> will >> >> >>> >>> be >> >> >>> >>> >> given committer status in Arrow. >> >> >>> >>> >> >> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet >> >> committers >> >> >>> >>> can't be >> >> >>> >>> >> worried at that point whether their patches will get merged; >> >> they >> >> >>> can >> >> >>> >>> just >> >> >>> >>> >> merge them. Arrow shouldn't worry much about inviting in the >> >> >>> Parquet >> >> >>> >>> >> committers. After all, Arrow already depends a lot on >> parquet so >> >> >>> why not >> >> >>> >>> >> invite them in? >> >> >>> >>> >> >> >> >>> >>> >> >> >>> >> >> >> >> >> >> >> >> >> -- >> >> >> regards, >> >> >> Deepak Majeti >> >> >> > >> > >> > -- >> > regards, >> > Deepak Majeti >>