Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Deepak Majeti Mon, 30 Jul 2018 14:19:20 -0700

@Wes
My observation is that most of the parquet-cpp contributors you listed that
overlap with the Arrow community mainly contribute to the Arrow
bindings(parquet::arrow layer)/platform API changes in the parquet-cpp
repo. Very few of them review/contribute patches to the parquet-cpp core.


I believe improvements to the parquet-cpp core will be negatively impacted
since merging the parquet-cpp and arrow-cpp repos will increase the barrier
of entry to new contributors interested in the parquet-cpp core. The
current extensions to the parquet-cpp core related to bloom-filters, and
column encryption are all being done by first-time contributors.

If you believe there will be new interest in the parquet-cpp core with the
mono-repo approach, I am all up for it.


On Mon, Jul 30, 2018 at 12:18 AM Philipp Moritz <pcmor...@gmail.com> wrote:

> I do not claim to have insight into parquet-cpp development. However, from
> our experience developing Ray, I can say that the monorepo approach (for
> Ray) has improved things a lot. Before we tried various schemes to split
> the project into multiple repos, but the build system and test
> infrastructure duplications and overhead from synchronizing changes slowed
> development down significantly (and fixing bugs that touch the subrepos and
> the main repo is inconvenient).
>
> Also the decision to put arrow and parquet-cpp into a common repo is
> independent of how tightly coupled the two projects are (and there could be
> a matrix entry in travis which tests that PRs keep them decoupled, or
> rather that they both just depend on a small common "base"). Google and
> Facebook demonstrate such independence by having many many projects in the
> same repo of course. It would be great if the open source community would
> move more into this direction too I think.
>
> Best,
> Philipp.
>
> On Sun, Jul 29, 2018 at 8:54 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi Donald,
> >
> > This would make things worse, not better. Code changes routinely
> > involve changes to the build system, and so you could be talking about
> > having to making changes to 2 or 3 git repositories as the result of a
> > single new feature or bug fix. There isn't really a cross-repo CI
> > solution available
> >
> > I've seen some approaches to the monorepo problem using multiple git
> > repositories, such as
> >
> > https://github.com/twosigma/git-meta
> >
> > Until something like this has first class support by the GitHub
> > platform and its CI services (Travis CI, Appveyor), I don't think it
> > will work for us.
> >
> > - Wes
> >
> > On Sun, Jul 29, 2018 at 10:54 PM, Donald E. Foss <donald.f...@gmail.com>
> > wrote:
> > > Could this work as each module gets configured as sub-git repots. Top
> > level
> > > build tool go into each sub-repo, pick the correct release version to
> > test.
> > > Tests in Python is dependent on cpp sub-repo to ensure the API still
> > pass.
> > >
> > > This should be the best of both worlds, if sub-repo are supposed
> option.
> > >
> > > --Donald E. Foss
> > >
> > > On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com>
> > > wrote:
> > >
> > >> I dislike the current build system complications as well.
> > >>
> > >> However, in my opinion, combining the code bases will severely impact
> > the
> > >> progress of the parquet-cpp project and implicitly the progress of the
> > >> entire parquet project.
> > >> Combining would have made much more sense if parquet-cpp is a mature
> > >> project and codebase.  But parquet-cpp (and the entire parquet
> project)
> > is
> > >> evolving continuously with new features being added including bloom
> > >> filters,  column encryption, and indexes.
> > >>
> > >> If the two code bases merged, it will be much more difficult to
> > contribute
> > >> to the parquet-cpp project since now Arrow bindings have to be
> > supported as
> > >> well. Please correct me if I am wrong here.
> > >>
> > >> Out of the two evils, I think handling the build system, packaging
> > >> duplication is much more manageable since they are quite stable at
> this
> > >> point.
> > >>
> > >> Regarding "* API changes cause awkward release coordination issues
> > between
> > >> Arrow and Parquet". Can we make minor releases for parquet-cpp (with
> API
> > >> changes needed) as and when Arrow is released?
> > >>
> > >> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> > >> converting between Arrow columnar memory format and Parquet". Can this
> > be
> > >> moved to the Arrow project and expose the more stable low-level APIs
> in
> > >> parquet-cpp?
> > >>
> > >> I am also curious if the Arrow and Parquet Java implementations have
> > >> similar API compatibility issues.
> > >>
> > >>
> > >> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >>
> > >> > hi folks,
> > >> >
> > >> > We've been struggling for quite some time with the development
> > >> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> > >> >
> > >> > To explain the root issues:
> > >> >
> > >> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> > >> > includes file interfaces, memory management, miscellaneous
> algorithms
> > >> > (e.g. dictionary encoding), etc. Note that before this "platform"
> > >> > dependency was introduced, there was significant duplicated code
> > >> > between these codebases and incompatible abstract interfaces for
> > >> > things like files
> > >> >
> > >> > * we maintain a Arrow conversion code in parquet-cpp for converting
> > >> > between Arrow columnar memory format and Parquet
> > >> >
> > >> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> > >> > Apache Arrow. This introduces a circular dependency into our CI.
> > >> >
> > >> > * Substantial portions of our CMake build system and related tooling
> > >> > are duplicated between the Arrow and Parquet repos
> > >> >
> > >> > * API changes cause awkward release coordination issues between
> Arrow
> > >> > and Parquet
> > >> >
> > >> > I believe the best way to remedy the situation is to adopt a
> > >> > "Community over Code" approach and find a way for the Parquet and
> > >> > Arrow C++ development communities to operate out of the same code
> > >> > repository, i.e. the apache/arrow git repository.
> > >> >
> > >> > This would bring major benefits:
> > >> >
> > >> > * Shared CMake build infrastructure, developer tools, and CI
> > >> > infrastructure (Parquet is already being built as a dependency in
> > >> > Arrow's CI systems)
> > >> >
> > >> > * Share packaging and release management infrastructure
> > >> >
> > >> > * Reduce / eliminate problems due to API changes (where we currently
> > >> > introduce breakage into our CI workflow when there is a breaking /
> > >> > incompatible change)
> > >> >
> > >> > * Arrow releases would include a coordinated snapshot of the Parquet
> > >> > implementation as it stands
> > >> >
> > >> > Continuing with the status quo has become unsatisfactory to me and
> as
> > >> > a result I've become less motivated to work on the parquet-cpp
> > >> > codebase.
> > >> >
> > >> > The only Parquet C++ committer who is not an Arrow committer is
> Deepak
> > >> > Majeti. I think the issue of commit privileges could be resolved
> > >> > without too much difficulty or time.
> > >> >
> > >> > I also think if it is truly necessary that the Apache Parquet
> > >> > community could create release scripts to cut a miniml versioned
> > >> > Apache Parquet C++ release if that is deemed truly necessary.
> > >> >
> > >> > I know that some people are wary of monorepos and megaprojects, but
> as
> > >> > an example TensorFlow is at least 10 times as large of a projects in
> > >> > terms of LOCs and number of different platform components, and it
> > >> > seems to be getting along just fine. I think we should be able to
> work
> > >> > together as a community to function just as well.
> > >> >
> > >> > Interested in the opinions of others, and any other ideas for
> > >> > practical solutions to the above problems.
> > >> >
> > >> > Thanks,
> > >> > Wes
> > >> >
> > >>
> > >>
> > >> --
> > >> regards,
> > >> Deepak Majeti
> > >>
> >
>


-- 
regards,
Deepak Majeti

Re: [DISCUSS] Solutions for improving the Arrow-Parquet C++ development morass

Reply via email to