Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Tue, 31 Jul 2018 07:29:05 -0700

@Antoine

> By the way, one concern with the monorepo approach: it would slightly 
> increase Arrow CI times (which are already too large).


A typical CI run in Arrow is taking about 45 minutes:
https://travis-ci.org/apache/arrow/builds/410119750

Parquet run takes about 28
https://travis-ci.org/apache/parquet-cpp/builds/410147208

Inevitably we will need to create some kind of bot to run certain
builds on-demand based on commit / PR metadata or on request.

The slowest build in Arrow (the Arrow C++/Python one) build could be
made substantially shorter by moving some of the slower parts (like
the Python ASV benchmarks) from being tested every-commit to nightly
or on demand. Using ASAN instead of valgrind in Travis would also
improve build times (valgrind build could be moved to a nightly
exhaustive test run)

- Wes

On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>> I would like to point out that arrow's use of orc is a great example of how 
>> it would be possible to manage parquet-cpp as a separate codebase. That 
>> gives me hope that the projects could be managed separately some day.
>
> Well, I don't know that ORC is the best example. The ORC C++ codebase
> features several areas of duplicated logic which could be replaced by
> components from the Arrow platform for better platform-wide
> interoperability:
>
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/OrcFile.hh#L37
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/MemoryPool.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/OutputStream.hh
>
> ORC's use of symbols from Protocol Buffers was actually a cause of
> bugs that we had to fix in Arrow's build system to prevent them from
> leaking to third party linkers when statically linked (ORC is only
> available for static linking at the moment AFAIK).
>
> I question whether it's worth the community's time long term to wear
> ourselves out defining custom "ports" / virtual interfaces in each
> library to plug components together rather than utilizing common
> platform APIs.
>
> - Wes
>
> On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <joshuasto...@gmail.com> 
> wrote:
>> You're point about the constraints of the ASF release process are well
>> taken and as a developer who's trying to work in the current environment I
>> would be much happier if the codebases were merged. The main issues I worry
>> about when you put codebases like these together are:
>>
>> 1. The delineation of API's become blurred and the code becomes too coupled
>> 2. Release of artifacts that are lower in the dependency tree are delayed
>> by artifacts higher in the dependency tree
>>
>> If the project/release management is structured well and someone keeps an
>> eye on the coupling, then I don't have any concerns.
>>
>> I would like to point out that arrow's use of orc is a great example of how
>> it would be possible to manage parquet-cpp as a separate codebase. That
>> gives me hope that the projects could be managed separately some day.
>>
>> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <wesmck...@gmail.com> wrote:
>>
>>> hi Josh,
>>>
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them together seems like the wrong choice.
>>>
>>> Apache is "Community over Code"; right now it's the same people
>>> building these projects -- my argument (which I think you agree with?)
>>> is that we should work more closely together until the community grows
>>> large enough to support larger-scope process than we have now. As
>>> you've seen, our process isn't serving developers of these projects.
>>>
>>> > I also think build tooling should be pulled into its own codebase.
>>>
>>> I don't see how this can possibly be practical taking into
>>> consideration the constraints imposed by the combination of the GitHub
>>> platform and the ASF release process. I'm all for being idealistic,
>>> but right now we need to be practical. Unless we can devise a
>>> practical procedure that can accommodate at least 1 patch per day
>>> which may touch both code and build system simultaneously without
>>> being a hindrance to contributor or maintainer, I don't see how we can
>>> move forward.
>>>
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short term with the express purpose of separating them in the near  term.
>>>
>>> I would agree but only if separation can be demonstrated to be
>>> practical and result in net improvements in productivity and community
>>> growth. I think experience has clearly demonstrated that the current
>>> separation is impractical, and is causing problems.
>>>
>>> Per Julian's and Ted's comments, I think we need to consider
>>> development process and ASF releases separately. My argument is as
>>> follows:
>>>
>>> * Monorepo for development (for practicality)
>>> * Releases structured according to the desires of the PMCs
>>>
>>> - Wes
>>>
>>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <joshuasto...@gmail.com>
>>> wrote:
>>> > I recently worked on an issue that had to be implemented in parquet-cpp
>>> > (ARROW-1644, ARROW-1599) but required changes in arrow (ARROW-2585,
>>> > ARROW-2586). I found the circular dependencies confusing and hard to work
>>> > with. For example, I still have a PR open in parquet-cpp (created on May
>>> > 10) because of a PR that it depended on in arrow that was recently
>>> merged.
>>> > I couldn't even address any CI issues in the PR because the change in
>>> arrow
>>> > was not yet in master. In a separate PR, I changed the
>>> run_clang_format.py
>>> > script in the arrow project only to find out later that there was an
>>> exact
>>> > copy of it in parquet-cpp.
>>> >
>>> > However, I don't think merging the codebases makes sense in the long
>>> term.
>>> > I can imagine use cases for parquet that don't involve arrow and tying
>>> them
>>> > together seems like the wrong choice. There will be other formats that
>>> > arrow needs to support that will be kept separate (e.g. - Orc), so I
>>> don't
>>> > see why parquet should be special. I also think build tooling should be
>>> > pulled into its own codebase. GNU has had a long history of developing
>>> open
>>> > source C/C++ projects that way and made projects like
>>> > autoconf/automake/make to support them. I don't think CI is a good
>>> > counter-example since there have been lots of successful open source
>>> > projects that have used nightly build systems that pinned versions of
>>> > dependent software.
>>> >
>>> > That being said, I think it makes sense to merge the codebases in the
>>> short
>>> > term with the express purpose of separating them in the near  term. My
>>> > reasoning is as follows. By putting the codebases together, you can more
>>> > easily delineate the boundaries between the API's with a single PR.
>>> Second,
>>> > it will force the build tooling to converge instead of diverge, which has
>>> > already happened. Once the boundaries and tooling have been sorted out,
>>> it
>>> > should be easy to separate them back into their own codebases.
>>> >
>>> > If the codebases are merged, I would ask that the C++ codebases for arrow
>>> > be separated from other languages. Looking at it from the perspective of
>>> a
>>> > parquet-cpp library user, having a dependency on Java is a large tax to
>>> pay
>>> > if you don't need it. For example, there were 25 JIRA's in the 0.10.0
>>> > release of arrow, many of which were holding up the release. I hope that
>>> > seems like a reasonable compromise, and I think it will help reduce the
>>> > complexity of the build/release tooling.
>>> >
>>> >
>>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <ted.dunn...@gmail.com>
>>> wrote:
>>> >
>>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <wesmck...@gmail.com>
>>> wrote:
>>> >>
>>> >> >
>>> >> > > The community will be less willing to accept large
>>> >> > > changes that require multiple rounds of patches for stability and
>>> API
>>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS community
>>> took
>>> >> a
>>> >> > > significantly long time for the very same reason.
>>> >> >
>>> >> > Please don't use bad experiences from another open source community as
>>> >> > leverage in this discussion. I'm sorry that things didn't go the way
>>> >> > you wanted in Apache Hadoop but this is a distinct community which
>>> >> > happens to operate under a similar open governance model.
>>> >>
>>> >>
>>> >> There are some more radical and community building options as well. Take
>>> >> the subversion project as a precedent. With subversion, any Apache
>>> >> committer can request and receive a commit bit on some large fraction of
>>> >> subversion.
>>> >>
>>> >> So why not take this a bit further and give every parquet committer a
>>> >> commit bit in Arrow? Or even make them be first class committers in
>>> Arrow?
>>> >> Possibly even make it policy that every Parquet committer who asks will
>>> be
>>> >> given committer status in Arrow.
>>> >>
>>> >> That relieves a lot of the social anxiety here. Parquet committers
>>> can't be
>>> >> worried at that point whether their patches will get merged; they can
>>> just
>>> >> merge them.  Arrow shouldn't worry much about inviting in the Parquet
>>> >> committers. After all, Arrow already depends a lot on parquet so why not
>>> >> invite them in?
>>> >>
>>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to