Could this work as each module gets configured as sub-git repots. Top level
build tool go into each sub-repo, pick the correct release version to test.
Tests in Python is dependent on cpp sub-repo to ensure the API still pass.

This should be the best of both worlds, if sub-repo are supposed option.

--Donald E. Foss

On Sun, Jul 29, 2018, 10:44 PM Deepak Majeti <majeti.dee...@gmail.com>
wrote:

> I dislike the current build system complications as well.
>
> However, in my opinion, combining the code bases will severely impact the
> progress of the parquet-cpp project and implicitly the progress of the
> entire parquet project.
> Combining would have made much more sense if parquet-cpp is a mature
> project and codebase.  But parquet-cpp (and the entire parquet project) is
> evolving continuously with new features being added including bloom
> filters,  column encryption, and indexes.
>
> If the two code bases merged, it will be much more difficult to contribute
> to the parquet-cpp project since now Arrow bindings have to be supported as
> well. Please correct me if I am wrong here.
>
> Out of the two evils, I think handling the build system, packaging
> duplication is much more manageable since they are quite stable at this
> point.
>
> Regarding "* API changes cause awkward release coordination issues between
> Arrow and Parquet". Can we make minor releases for parquet-cpp (with API
> changes needed) as and when Arrow is released?
>
> Regarding "we maintain a Arrow conversion code in parquet-cpp for
> converting between Arrow columnar memory format and Parquet". Can this be
> moved to the Arrow project and expose the more stable low-level APIs in
> parquet-cpp?
>
> I am also curious if the Arrow and Parquet Java implementations have
> similar API compatibility issues.
>
>
> On Sat, Jul 28, 2018 at 7:45 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > hi folks,
> >
> > We've been struggling for quite some time with the development
> > workflow between the Arrow and Parquet C++ (and Python) codebases.
> >
> > To explain the root issues:
> >
> > * parquet-cpp depends on "platform code" in Apache Arrow; this
> > includes file interfaces, memory management, miscellaneous algorithms
> > (e.g. dictionary encoding), etc. Note that before this "platform"
> > dependency was introduced, there was significant duplicated code
> > between these codebases and incompatible abstract interfaces for
> > things like files
> >
> > * we maintain a Arrow conversion code in parquet-cpp for converting
> > between Arrow columnar memory format and Parquet
> >
> > * we maintain Python bindings for parquet-cpp + Arrow interop in
> > Apache Arrow. This introduces a circular dependency into our CI.
> >
> > * Substantial portions of our CMake build system and related tooling
> > are duplicated between the Arrow and Parquet repos
> >
> > * API changes cause awkward release coordination issues between Arrow
> > and Parquet
> >
> > I believe the best way to remedy the situation is to adopt a
> > "Community over Code" approach and find a way for the Parquet and
> > Arrow C++ development communities to operate out of the same code
> > repository, i.e. the apache/arrow git repository.
> >
> > This would bring major benefits:
> >
> > * Shared CMake build infrastructure, developer tools, and CI
> > infrastructure (Parquet is already being built as a dependency in
> > Arrow's CI systems)
> >
> > * Share packaging and release management infrastructure
> >
> > * Reduce / eliminate problems due to API changes (where we currently
> > introduce breakage into our CI workflow when there is a breaking /
> > incompatible change)
> >
> > * Arrow releases would include a coordinated snapshot of the Parquet
> > implementation as it stands
> >
> > Continuing with the status quo has become unsatisfactory to me and as
> > a result I've become less motivated to work on the parquet-cpp
> > codebase.
> >
> > The only Parquet C++ committer who is not an Arrow committer is Deepak
> > Majeti. I think the issue of commit privileges could be resolved
> > without too much difficulty or time.
> >
> > I also think if it is truly necessary that the Apache Parquet
> > community could create release scripts to cut a miniml versioned
> > Apache Parquet C++ release if that is deemed truly necessary.
> >
> > I know that some people are wary of monorepos and megaprojects, but as
> > an example TensorFlow is at least 10 times as large of a projects in
> > terms of LOCs and number of different platform components, and it
> > seems to be getting along just fine. I think we should be able to work
> > together as a community to function just as well.
> >
> > Interested in the opinions of others, and any other ideas for
> > practical solutions to the above problems.
> >
> > Thanks,
> > Wes
> >
>
>
> --
> regards,
> Deepak Majeti
>

Reply via email to