>From my perspective I agree, that I don't think there is benefit of moving parquet C++ out of arrow given what it would actually cost to make clean boundaries. I also don't think it will hurt iteration speed.
I think the main challenge could be in compatibility testing, but Arrow has solved this between implementations that live in different repositories so I think the same solutions could apply for Parquet. On Thu, May 16, 2024 at 12:57 AM Antoine Pitrou <anto...@python.org> wrote: > On Tue, 14 May 2024 10:22:37 -0700 > Julien Le Dem <jul...@apache.org> wrote: > > 1. I think we should make it easy for people contributing to the C++ > > codebase. (which is why I voted for the move at the time) > > 2. If merging repos removes the need to deal with the circular dependency > > between repos issue for the C++ code bases, it does it at the expense of > > making it easy to evolve the parquet spec and the java and c++ > > implementations together. > > Hmm... I'm not sure I understand your point here. The Parquet spec and > the Java implementation are already living in distinct repos and have > distinct versioning schemes. The main thing that they share in common is > the JIRA instance (while the C++ Parquet implementation mostly relies on > Arrow's GH issue tracker), but is that really important? > > > parquet-cpp depends only on arrow-core that does not have to depend on > > parquet-cpp. > > That is true. > > > Other components like > > arrow-dataset and pyarrow can depend on parquet-cpp just like they depend > > on orc externally. > > Ideally yes. In practice there are two problems: > 1) it creates a circular dependency between *repositories*. > 2) the C++ Arrow Datasets component is not built independently, it is an > optional component when building Arrow C++. So we would also have a > chicken-and-egg problem when building Arrow C++ and Parquet C++. > > > I realize that would be work to make it happen, but the current location > of > > the parquet-cpp codebase is a big trade-off of prioritizing quick > iteration > > on the C++ implementations over iteration on the format. > > Having recently worked on a format addition and its respective > implementations (in Java and C++), I haven't found the current setup > more difficult to work with for Parquet C++ than it was for Parquet > Java. Admittedly I'm biased, being a heavy contributor to Arrow C++, > but I'm curious why the current situation would be detrimental to > iteration on the format. > > Regards > > Antoine. > > >