Re: [C++] Parquet and Arrow overlap

Micah Kornfield Thu, 16 May 2024 01:00:51 -0700

>From my perspective I agree, that I don't think there is benefit of moving
parquet C++ out of arrow given what it would actually cost to make clean
boundaries.  I also don't think it will hurt iteration speed.


I think the main challenge could be in compatibility testing, but Arrow has
solved this between implementations that live in different repositories so
I think the same solutions could apply for Parquet.

On Thu, May 16, 2024 at 12:57 AM Antoine Pitrou <anto...@python.org> wrote:

> On Tue, 14 May 2024 10:22:37 -0700
> Julien Le Dem <jul...@apache.org> wrote:
> > 1. I think we should make it easy for people contributing to the C++
> > codebase. (which is why I voted for the move at the time)
> > 2. If merging repos removes the need to deal with the circular dependency
> > between repos issue for the C++ code bases, it does it at the expense of
> > making it easy to evolve the parquet spec and the java and c++
> > implementations together.
>
> Hmm... I'm not sure I understand your point here. The Parquet spec and
> the Java implementation are already living in distinct repos and have
> distinct versioning schemes. The main thing that they share in common is
> the JIRA instance (while the C++ Parquet implementation mostly relies on
> Arrow's GH issue tracker), but is that really important?
>
> > parquet-cpp depends only on arrow-core that does not have to depend on
> > parquet-cpp.
>
> That is true.
>
> > Other components like
> > arrow-dataset and pyarrow can depend on parquet-cpp just like they depend
> > on orc externally.
>
> Ideally yes. In practice there are two problems:
> 1) it creates a circular dependency between *repositories*.
> 2) the C++ Arrow Datasets component is not built independently, it is an
> optional component when building Arrow C++. So we would also have a
> chicken-and-egg problem when building Arrow C++ and Parquet C++.
>
> > I realize that would be work to make it happen, but the current location
> of
> > the parquet-cpp codebase is a big trade-off of prioritizing quick
> iteration
> > on the C++ implementations over iteration on the format.
>
> Having recently worked on a format addition and its respective
> implementations (in Java and C++), I haven't found the current setup
> more difficult to work with for Parquet C++ than it was for Parquet
> Java. Admittedly I'm biased, being a heavy contributor to Arrow C++,
> but I'm curious why the current situation would be detrimental to
> iteration on the format.
>
> Regards
>
> Antoine.
>
>
>

Re: [C++] Parquet and Arrow overlap

Reply via email to