Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Wes McKinney Wed, 01 Aug 2018 12:40:58 -0700

Thanks Tim.

Indeed, it's not very simple. Just today Antoine cleaned up some
platform code intending to improve the performance of bit-packing in
Parquet writes, and we resulted with 2 interdependent PRs


* https://github.com/apache/parquet-cpp/pull/483
* https://github.com/apache/arrow/pull/2355

Changes that impact the Python interface to Parquet are even more complex.

Adding options to Arrow's CMake build system to only build
Parquet-related code and dependencies (in a monorepo framework) would
not be difficult, and amount to writing "make parquet".

See e.g. https://stackoverflow.com/a/17201375. The desired commands to
build and install the Parquet core libraries and their dependencies
would be:

ninja parquet && ninja install

- Wes

On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong
<[email protected]> wrote:
> I don't have a direct stake in this beyond wanting to see Parquet be
> successful, but I thought I'd give my two cents.
>
> For me, the thing that makes the biggest difference in contributing to a
> new codebase is the number of steps in the workflow for writing, testing,
> posting and iterating on a commit and also the number of opportunities for
> missteps. The size of the repo and build/test times matter but are
> secondary so long as the workflow is simple and reliable.
>
> I don't really know what the current state of things is, but it sounds like
> it's not as simple as check out -> build -> test if you're doing a
> cross-repo change. Circular dependencies are a real headache.
>
> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <[email protected]> wrote:
>
>> hi,
>>
>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti <[email protected]>
>> wrote:
>> > I think the circular dependency can be broken if we build a new library
>> for
>> > the platform code. This will also make it easy for other projects such as
>> > ORC to use it.
>> > I also remember your proposal a while ago of having a separate project
>> for
>> > the platform code.  That project can live in the arrow repo. However, one
>> > has to clone the entire apache arrow repo but can just build the platform
>> > code. This will be temporary until we can find a new home for it.
>> >
>> > The dependency will look like:
>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <-
>> > libplatform(platform api)
>> >
>> > CI workflow will clone the arrow project twice, once for the platform
>> > library and once for the arrow-core/bindings library.
>>
>> This seems like an interesting proposal; the best place to work toward
>> this goal (if it is even possible; the build system interactions and
>> ASF release management are the hard problems) is to have all of the
>> code in a single repository. ORC could already be using Arrow if it
>> wanted, but the ORC contributors aren't active in Arrow.
>>
>> >
>> > There is no doubt that the collaborations between the Arrow and Parquet
>> > communities so far have been very successful.
>> > The reason to maintain this relationship moving forward is to continue to
>> > reap the mutual benefits.
>> > We should continue to take advantage of sharing code as well. However, I
>> > don't see any code sharing opportunities between arrow-core and the
>> > parquet-core. Both have different functions.
>>
>> I think you mean the Arrow columnar format. The Arrow columnar format
>> is only one part of a project that has become quite large already
>> (https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development-
>> platform-for-inmemory-data-105427919).
>>
>> >
>> > We are at a point where the parquet-cpp public API is pretty stable. We
>> > already passed that difficult stage. My take at arrow and parquet is to
>> > keep them nimble since we can.
>>
>> I believe that parquet-core has progress to make yet ahead of it. We
>> have done little work in asynchronous IO and concurrency which would
>> yield both improved read and write throughput. This aligns well with
>> other concurrency and async-IO work planned in the Arrow platform. I
>> believe that more development will happen on parquet-core once the
>> development process issues are resolved by having a single codebase,
>> single build system, and a single CI framework.
>>
>> I have some gripes about design decisions made early in parquet-cpp,
>> like the use of C++ exceptions. So while "stability" is a reasonable
>> goal I think we should still be open to making significant changes in
>> the interest of long term progress.
>>
>> Having now worked on these projects for more than 2 and a half years
>> and the most frequent contributor to both codebases, I'm sadly far
>> past the "breaking point" and not willing to continue contributing in
>> a significant way to parquet-cpp if the projects remained structured
>> as they are now. It's hampering progress and not serving the
>> community.
>>
>> - Wes
>>
>> >
>> >
>> >
>> >
>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <[email protected]>
>> wrote:
>> >
>> >> > The current Arrow adaptor code for parquet should live in the arrow
>> >> repo. That will remove a majority of the dependency issues. Joshua's
>> work
>> >> would not have been blocked in parquet-cpp if that adapter was in the
>> arrow
>> >> repo.  This will be similar to the ORC adaptor.
>> >>
>> >> This has been suggested before, but I don't see how it would alleviate
>> >> any issues because of the significant dependencies on other parts of
>> >> the Arrow codebase. What you are proposing is:
>> >>
>> >> - (Arrow) arrow platform
>> >> - (Parquet) parquet core
>> >> - (Arrow) arrow columnar-parquet adapter interface
>> >> - (Arrow) Python bindings
>> >>
>> >> To make this work, somehow Arrow core / libarrow would have to be
>> >> built before invoking the Parquet core part of the build system. You
>> >> would need to pass dependent targets across different CMake build
>> >> systems; I don't know if it's possible (I spent some time looking into
>> >> it earlier this year). This is what I meant by the lack of a "concrete
>> >> and actionable plan". The only thing that would really work would be
>> >> for the Parquet core to be "included" in the Arrow build system
>> >> somehow rather than using ExternalProject. Currently Parquet builds
>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow build
>> >> system because it's only depended upon by the Python bindings.
>> >>
>> >> And even if a solution could be devised, it would not wholly resolve
>> >> the CI workflow issues.
>> >>
>> >> You could make Parquet completely independent of the Arrow codebase,
>> >> but at that point there is little reason to maintain a relationship
>> >> between the projects or their communities. We have spent a great deal
>> >> of effort refactoring the two projects to enable as much code sharing
>> >> as there is now.
>> >>
>> >> - Wes
>> >>
>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <[email protected]>
>> wrote:
>> >> >> If you still strongly feel that the only way forward is to clone the
>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having two
>> >> parquet-cpp repos is no way a better approach.
>> >> >
>> >> > Yes, indeed. In my view, the next best option after a monorepo is to
>> >> > fork. That would obviously be a bad outcome for the community.
>> >> >
>> >> > It doesn't look like I will be able to convince you that a monorepo is
>> >> > a good idea; what I would ask instead is that you be willing to give
>> >> > it a shot, and if it turns out in the way you're describing (which I
>> >> > don't think it will) then I suggest that we fork at that point.
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti <
>> [email protected]>
>> >> wrote:
>> >> >> Wes,
>> >> >>
>> >> >> Unfortunately, I cannot show you any practical fact-based problems
>> of a
>> >> >> non-existent Arrow-Parquet mono-repo.
>> >> >> Bringing in related Apache community experiences are more meaningful
>> >> than
>> >> >> how mono-repos work at Google and other big organizations.
>> >> >> We solely depend on volunteers and cannot hire full-time developers.
>> >> >> You are very well aware of how difficult it has been to find more
>> >> >> contributors and maintainers for Arrow. parquet-cpp already has a low
>> >> >> contribution rate to its core components.
>> >> >>
>> >> >> We should target to ensure that new volunteers who want to contribute
>> >> >> bug-fixes/features should spend the least amount of time in figuring
>> out
>> >> >> the project repo. We can never come up with an automated build system
>> >> that
>> >> >> caters to every possible environment.
>> >> >> My only concern is if the mono-repo will make it harder for new
>> >> developers
>> >> >> to work on parquet-cpp core just due to the additional code, build
>> and
>> >> test
>> >> >> dependencies.
>> >> >> I am not saying that the Arrow community/committers will be less
>> >> >> co-operative.
>> >> >> I just don't think the mono-repo structure model will be sustainable
>> in
>> >> an
>> >> >> open source community unless there are long-term vested interests. We
>> >> can't
>> >> >> predict that.
>> >> >>
>> >> >> The current circular dependency problems between Arrow and Parquet
>> is a
>> >> >> major problem for the community and it is important.
>> >> >>
>> >> >> The current Arrow adaptor code for parquet should live in the arrow
>> >> repo.
>> >> >> That will remove a majority of the dependency issues.
>> >> >> Joshua's work would not have been blocked in parquet-cpp if that
>> adapter
>> >> >> was in the arrow repo.  This will be similar to the ORC adaptor.
>> >> >>
>> >> >> The platform API code is pretty stable at this point. Minor changes
>> in
>> >> the
>> >> >> future to this code should not be the main reason to combine the
>> arrow
>> >> >> parquet repos.
>> >> >>
>> >> >> "
>> >> >> *I question whether it's worth the community's time long term to
>> wear*
>> >> >>
>> >> >>
>> >> >> *ourselves out defining custom "ports" / virtual interfaces in
>> >> eachlibrary
>> >> >> to plug components together rather than utilizing commonplatform
>> APIs.*"
>> >> >>
>> >> >> My answer to your question below would be "Yes".
>> Modularity/separation
>> >> is
>> >> >> very important in an open source community where priorities of
>> >> contributors
>> >> >> are often short term.
>> >> >> The retention is low and therefore the acquisition costs should be
>> low
>> >> as
>> >> >> well. This is the community over code approach according to me. Minor
>> >> code
>> >> >> duplication is not a deal breaker.
>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big
>> data
>> >> >> space serving their own functions.
>> >> >>
>> >> >> If you still strongly feel that the only way forward is to clone the
>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. Having
>> two
>> >> >> parquet-cpp repos is no way a better approach.
>> >> >>
>> >> >>
>> >> >>
>> >> >>
>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney <[email protected]>
>> >> wrote:
>> >> >>
>> >> >>> @Antoine
>> >> >>>
>> >> >>> > By the way, one concern with the monorepo approach: it would
>> slightly
>> >> >>> increase Arrow CI times (which are already too large).
>> >> >>>
>> >> >>> A typical CI run in Arrow is taking about 45 minutes:
>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750
>> >> >>>
>> >> >>> Parquet run takes about 28
>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208
>> >> >>>
>> >> >>> Inevitably we will need to create some kind of bot to run certain
>> >> >>> builds on-demand based on commit / PR metadata or on request.
>> >> >>>
>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build could be
>> >> >>> made substantially shorter by moving some of the slower parts (like
>> >> >>> the Python ASV benchmarks) from being tested every-commit to nightly
>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would also
>> >> >>> improve build times (valgrind build could be moved to a nightly
>> >> >>> exhaustive test run)
>> >> >>>
>> >> >>> - Wes
>> >> >>>
>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney <[email protected]
>> >
>> >> >>> wrote:
>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >> example of
>> >> >>> how it would be possible to manage parquet-cpp as a separate
>> codebase.
>> >> That
>> >> >>> gives me hope that the projects could be managed separately some
>> day.
>> >> >>> >
>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++
>> codebase
>> >> >>> > features several areas of duplicated logic which could be
>> replaced by
>> >> >>> > components from the Arrow platform for better platform-wide
>> >> >>> > interoperability:
>> >> >>> >
>> >> >>> >
>> >> >>>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> orc/OrcFile.hh#L37
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh
>> >> >>> >
>> >> >>>
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/
>> orc/MemoryPool.hh
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh
>> >> >>> >
>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/
>> OutputStream.hh
>> >> >>> >
>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a cause of
>> >> >>> > bugs that we had to fix in Arrow's build system to prevent them
>> from
>> >> >>> > leaking to third party linkers when statically linked (ORC is only
>> >> >>> > available for static linking at the moment AFAIK).
>> >> >>> >
>> >> >>> > I question whether it's worth the community's time long term to
>> wear
>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in each
>> >> >>> > library to plug components together rather than utilizing common
>> >> >>> > platform APIs.
>> >> >>> >
>> >> >>> > - Wes
>> >> >>> >
>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck <
>> >> [email protected]>
>> >> >>> wrote:
>> >> >>> >> You're point about the constraints of the ASF release process are
>> >> well
>> >> >>> >> taken and as a developer who's trying to work in the current
>> >> >>> environment I
>> >> >>> >> would be much happier if the codebases were merged. The main
>> issues
>> >> I
>> >> >>> worry
>> >> >>> >> about when you put codebases like these together are:
>> >> >>> >>
>> >> >>> >> 1. The delineation of API's become blurred and the code becomes
>> too
>> >> >>> coupled
>> >> >>> >> 2. Release of artifacts that are lower in the dependency tree are
>> >> >>> delayed
>> >> >>> >> by artifacts higher in the dependency tree
>> >> >>> >>
>> >> >>> >> If the project/release management is structured well and someone
>> >> keeps
>> >> >>> an
>> >> >>> >> eye on the coupling, then I don't have any concerns.
>> >> >>> >>
>> >> >>> >> I would like to point out that arrow's use of orc is a great
>> >> example of
>> >> >>> how
>> >> >>> >> it would be possible to manage parquet-cpp as a separate
>> codebase.
>> >> That
>> >> >>> >> gives me hope that the projects could be managed separately some
>> >> day.
>> >> >>> >>
>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney <
>> [email protected]>
>> >> >>> wrote:
>> >> >>> >>
>> >> >>> >>> hi Josh,
>> >> >>> >>>
>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>> and
>> >> >>> tying
>> >> >>> >>> them together seems like the wrong choice.
>> >> >>> >>>
>> >> >>> >>> Apache is "Community over Code"; right now it's the same people
>> >> >>> >>> building these projects -- my argument (which I think you agree
>> >> with?)
>> >> >>> >>> is that we should work more closely together until the community
>> >> grows
>> >> >>> >>> large enough to support larger-scope process than we have now.
>> As
>> >> >>> >>> you've seen, our process isn't serving developers of these
>> >> projects.
>> >> >>> >>>
>> >> >>> >>> > I also think build tooling should be pulled into its own
>> >> codebase.
>> >> >>> >>>
>> >> >>> >>> I don't see how this can possibly be practical taking into
>> >> >>> >>> consideration the constraints imposed by the combination of the
>> >> GitHub
>> >> >>> >>> platform and the ASF release process. I'm all for being
>> idealistic,
>> >> >>> >>> but right now we need to be practical. Unless we can devise a
>> >> >>> >>> practical procedure that can accommodate at least 1 patch per
>> day
>> >> >>> >>> which may touch both code and build system simultaneously
>> without
>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see how
>> we
>> >> can
>> >> >>> >>> move forward.
>> >> >>> >>>
>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>> >> in the
>> >> >>> >>> short term with the express purpose of separating them in the
>> near
>> >> >>> term.
>> >> >>> >>>
>> >> >>> >>> I would agree but only if separation can be demonstrated to be
>> >> >>> >>> practical and result in net improvements in productivity and
>> >> community
>> >> >>> >>> growth. I think experience has clearly demonstrated that the
>> >> current
>> >> >>> >>> separation is impractical, and is causing problems.
>> >> >>> >>>
>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider
>> >> >>> >>> development process and ASF releases separately. My argument is
>> as
>> >> >>> >>> follows:
>> >> >>> >>>
>> >> >>> >>> * Monorepo for development (for practicality)
>> >> >>> >>> * Releases structured according to the desires of the PMCs
>> >> >>> >>>
>> >> >>> >>> - Wes
>> >> >>> >>>
>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck <
>> >> [email protected]
>> >> >>> >
>> >> >>> >>> wrote:
>> >> >>> >>> > I recently worked on an issue that had to be implemented in
>> >> >>> parquet-cpp
>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow
>> >> (ARROW-2585,
>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing and
>> >> hard to
>> >> >>> work
>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp
>> >> (created on
>> >> >>> May
>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was
>> >> recently
>> >> >>> >>> merged.
>> >> >>> >>> > I couldn't even address any CI issues in the PR because the
>> >> change in
>> >> >>> >>> arrow
>> >> >>> >>> > was not yet in master. In a separate PR, I changed the
>> >> >>> >>> run_clang_format.py
>> >> >>> >>> > script in the arrow project only to find out later that there
>> >> was an
>> >> >>> >>> exact
>> >> >>> >>> > copy of it in parquet-cpp.
>> >> >>> >>> >
>> >> >>> >>> > However, I don't think merging the codebases makes sense in
>> the
>> >> long
>> >> >>> >>> term.
>> >> >>> >>> > I can imagine use cases for parquet that don't involve arrow
>> and
>> >> >>> tying
>> >> >>> >>> them
>> >> >>> >>> > together seems like the wrong choice. There will be other
>> formats
>> >> >>> that
>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. -
>> Orc),
>> >> so I
>> >> >>> >>> don't
>> >> >>> >>> > see why parquet should be special. I also think build tooling
>> >> should
>> >> >>> be
>> >> >>> >>> > pulled into its own codebase. GNU has had a long history of
>> >> >>> developing
>> >> >>> >>> open
>> >> >>> >>> > source C/C++ projects that way and made projects like
>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI is a
>> >> good
>> >> >>> >>> > counter-example since there have been lots of successful open
>> >> source
>> >> >>> >>> > projects that have used nightly build systems that pinned
>> >> versions of
>> >> >>> >>> > dependent software.
>> >> >>> >>> >
>> >> >>> >>> > That being said, I think it makes sense to merge the codebases
>> >> in the
>> >> >>> >>> short
>> >> >>> >>> > term with the express purpose of separating them in the near
>> >> term.
>> >> >>> My
>> >> >>> >>> > reasoning is as follows. By putting the codebases together,
>> you
>> >> can
>> >> >>> more
>> >> >>> >>> > easily delineate the boundaries between the API's with a
>> single
>> >> PR.
>> >> >>> >>> Second,
>> >> >>> >>> > it will force the build tooling to converge instead of
>> diverge,
>> >> >>> which has
>> >> >>> >>> > already happened. Once the boundaries and tooling have been
>> >> sorted
>> >> >>> out,
>> >> >>> >>> it
>> >> >>> >>> > should be easy to separate them back into their own codebases.
>> >> >>> >>> >
>> >> >>> >>> > If the codebases are merged, I would ask that the C++
>> codebases
>> >> for
>> >> >>> arrow
>> >> >>> >>> > be separated from other languages. Looking at it from the
>> >> >>> perspective of
>> >> >>> >>> a
>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a
>> large
>> >> tax
>> >> >>> to
>> >> >>> >>> pay
>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's in the
>> >> 0.10.0
>> >> >>> >>> > release of arrow, many of which were holding up the release. I
>> >> hope
>> >> >>> that
>> >> >>> >>> > seems like a reasonable compromise, and I think it will help
>> >> reduce
>> >> >>> the
>> >> >>> >>> > complexity of the build/release tooling.
>> >> >>> >>> >
>> >> >>> >>> >
>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning <
>> >> [email protected]>
>> >> >>> >>> wrote:
>> >> >>> >>> >
>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney <
>> >> [email protected]>
>> >> >>> >>> wrote:
>> >> >>> >>> >>
>> >> >>> >>> >> >
>> >> >>> >>> >> > > The community will be less willing to accept large
>> >> >>> >>> >> > > changes that require multiple rounds of patches for
>> >> stability
>> >> >>> and
>> >> >>> >>> API
>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the HDFS
>> >> >>> community
>> >> >>> >>> took
>> >> >>> >>> >> a
>> >> >>> >>> >> > > significantly long time for the very same reason.
>> >> >>> >>> >> >
>> >> >>> >>> >> > Please don't use bad experiences from another open source
>> >> >>> community as
>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things didn't
>> go
>> >> the
>> >> >>> way
>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct
>> community
>> >> which
>> >> >>> >>> >> > happens to operate under a similar open governance model.
>> >> >>> >>> >>
>> >> >>> >>> >>
>> >> >>> >>> >> There are some more radical and community building options as
>> >> well.
>> >> >>> Take
>> >> >>> >>> >> the subversion project as a precedent. With subversion, any
>> >> Apache
>> >> >>> >>> >> committer can request and receive a commit bit on some large
>> >> >>> fraction of
>> >> >>> >>> >> subversion.
>> >> >>> >>> >>
>> >> >>> >>> >> So why not take this a bit further and give every parquet
>> >> committer
>> >> >>> a
>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class
>> >> committers in
>> >> >>> >>> Arrow?
>> >> >>> >>> >> Possibly even make it policy that every Parquet committer who
>> >> asks
>> >> >>> will
>> >> >>> >>> be
>> >> >>> >>> >> given committer status in Arrow.
>> >> >>> >>> >>
>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet
>> >> committers
>> >> >>> >>> can't be
>> >> >>> >>> >> worried at that point whether their patches will get merged;
>> >> they
>> >> >>> can
>> >> >>> >>> just
>> >> >>> >>> >> merge them.  Arrow shouldn't worry much about inviting in the
>> >> >>> Parquet
>> >> >>> >>> >> committers. After all, Arrow already depends a lot on
>> parquet so
>> >> >>> why not
>> >> >>> >>> >> invite them in?
>> >> >>> >>> >>
>> >> >>> >>>
>> >> >>>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> regards,
>> >> >> Deepak Majeti
>> >>
>> >
>> >
>> > --
>> > regards,
>> > Deepak Majeti
>>

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to