I have a few more logistical questions to add. It will be difficult to track parquet-cpp changes if they get mixed with Arrow changes. Will we establish some guidelines for filing Parquet JIRAs? Can we enforce that parquet-cpp changes will not be committed without a corresponding Parquet JIRA?
I would also like to keep changes to parquet-cpp on a separate commit to simplify forking later (if needed) and be able to maintain the commit history. I don't know if its possible to squash parquet-cpp commits and arrow commits separately before merging. On Tue, Aug 7, 2018 at 8:57 AM Wes McKinney <wesmck...@gmail.com> wrote: > Do other people have opinions? I would like to undertake this work in > the near future (the next 8-10 weeks); I would be OK with taking > responsibility for the primary codebase surgery. > > Some logistical questions: > > * We have a handful of pull requests in flight in parquet-cpp that > would need to be resolved / merged > * We should probably cut a status-quo cpp-1.5.0 release, with future > releases cut out of the new structure > * Management of shared commit rights (I can discuss with the Arrow > PMC; I believe that approving any committer who has actively > maintained parquet-cpp should be a reasonable approach per Ted's > comments) > > If working more closely together proves to not be working out after > some period of time, I will be fully supportive of a fork or something > like it > > Thanks, > Wes > > On Wed, Aug 1, 2018 at 3:39 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > Thanks Tim. > > > > Indeed, it's not very simple. Just today Antoine cleaned up some > > platform code intending to improve the performance of bit-packing in > > Parquet writes, and we resulted with 2 interdependent PRs > > > > * https://github.com/apache/parquet-cpp/pull/483 > > * https://github.com/apache/arrow/pull/2355 > > > > Changes that impact the Python interface to Parquet are even more > complex. > > > > Adding options to Arrow's CMake build system to only build > > Parquet-related code and dependencies (in a monorepo framework) would > > not be difficult, and amount to writing "make parquet". > > > > See e.g. https://stackoverflow.com/a/17201375. The desired commands to > > build and install the Parquet core libraries and their dependencies > > would be: > > > > ninja parquet && ninja install > > > > - Wes > > > > On Wed, Aug 1, 2018 at 2:34 PM, Tim Armstrong > > <tarmstr...@cloudera.com.invalid> wrote: > >> I don't have a direct stake in this beyond wanting to see Parquet be > >> successful, but I thought I'd give my two cents. > >> > >> For me, the thing that makes the biggest difference in contributing to a > >> new codebase is the number of steps in the workflow for writing, > testing, > >> posting and iterating on a commit and also the number of opportunities > for > >> missteps. The size of the repo and build/test times matter but are > >> secondary so long as the workflow is simple and reliable. > >> > >> I don't really know what the current state of things is, but it sounds > like > >> it's not as simple as check out -> build -> test if you're doing a > >> cross-repo change. Circular dependencies are a real headache. > >> > >> On Tue, Jul 31, 2018 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> > wrote: > >> > >>> hi, > >>> > >>> On Tue, Jul 31, 2018 at 4:56 PM, Deepak Majeti < > majeti.dee...@gmail.com> > >>> wrote: > >>> > I think the circular dependency can be broken if we build a new > library > >>> for > >>> > the platform code. This will also make it easy for other projects > such as > >>> > ORC to use it. > >>> > I also remember your proposal a while ago of having a separate > project > >>> for > >>> > the platform code. That project can live in the arrow repo. > However, one > >>> > has to clone the entire apache arrow repo but can just build the > platform > >>> > code. This will be temporary until we can find a new home for it. > >>> > > >>> > The dependency will look like: > >>> > libarrow(arrow core / bindings) <- libparquet (parquet core) <- > >>> > libplatform(platform api) > >>> > > >>> > CI workflow will clone the arrow project twice, once for the platform > >>> > library and once for the arrow-core/bindings library. > >>> > >>> This seems like an interesting proposal; the best place to work toward > >>> this goal (if it is even possible; the build system interactions and > >>> ASF release management are the hard problems) is to have all of the > >>> code in a single repository. ORC could already be using Arrow if it > >>> wanted, but the ORC contributors aren't active in Arrow. > >>> > >>> > > >>> > There is no doubt that the collaborations between the Arrow and > Parquet > >>> > communities so far have been very successful. > >>> > The reason to maintain this relationship moving forward is to > continue to > >>> > reap the mutual benefits. > >>> > We should continue to take advantage of sharing code as well. > However, I > >>> > don't see any code sharing opportunities between arrow-core and the > >>> > parquet-core. Both have different functions. > >>> > >>> I think you mean the Arrow columnar format. The Arrow columnar format > >>> is only one part of a project that has become quite large already > >>> ( > https://www.slideshare.net/wesm/apache-arrow-crosslanguage-development- > >>> platform-for-inmemory-data-105427919). > >>> > >>> > > >>> > We are at a point where the parquet-cpp public API is pretty stable. > We > >>> > already passed that difficult stage. My take at arrow and parquet is > to > >>> > keep them nimble since we can. > >>> > >>> I believe that parquet-core has progress to make yet ahead of it. We > >>> have done little work in asynchronous IO and concurrency which would > >>> yield both improved read and write throughput. This aligns well with > >>> other concurrency and async-IO work planned in the Arrow platform. I > >>> believe that more development will happen on parquet-core once the > >>> development process issues are resolved by having a single codebase, > >>> single build system, and a single CI framework. > >>> > >>> I have some gripes about design decisions made early in parquet-cpp, > >>> like the use of C++ exceptions. So while "stability" is a reasonable > >>> goal I think we should still be open to making significant changes in > >>> the interest of long term progress. > >>> > >>> Having now worked on these projects for more than 2 and a half years > >>> and the most frequent contributor to both codebases, I'm sadly far > >>> past the "breaking point" and not willing to continue contributing in > >>> a significant way to parquet-cpp if the projects remained structured > >>> as they are now. It's hampering progress and not serving the > >>> community. > >>> > >>> - Wes > >>> > >>> > > >>> > > >>> > > >>> > > >>> > On Tue, Jul 31, 2018 at 3:17 PM Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>> > > >>> >> > The current Arrow adaptor code for parquet should live in the > arrow > >>> >> repo. That will remove a majority of the dependency issues. Joshua's > >>> work > >>> >> would not have been blocked in parquet-cpp if that adapter was in > the > >>> arrow > >>> >> repo. This will be similar to the ORC adaptor. > >>> >> > >>> >> This has been suggested before, but I don't see how it would > alleviate > >>> >> any issues because of the significant dependencies on other parts of > >>> >> the Arrow codebase. What you are proposing is: > >>> >> > >>> >> - (Arrow) arrow platform > >>> >> - (Parquet) parquet core > >>> >> - (Arrow) arrow columnar-parquet adapter interface > >>> >> - (Arrow) Python bindings > >>> >> > >>> >> To make this work, somehow Arrow core / libarrow would have to be > >>> >> built before invoking the Parquet core part of the build system. You > >>> >> would need to pass dependent targets across different CMake build > >>> >> systems; I don't know if it's possible (I spent some time looking > into > >>> >> it earlier this year). This is what I meant by the lack of a > "concrete > >>> >> and actionable plan". The only thing that would really work would be > >>> >> for the Parquet core to be "included" in the Arrow build system > >>> >> somehow rather than using ExternalProject. Currently Parquet builds > >>> >> Arrow using ExternalProject, and Parquet is unknown to the Arrow > build > >>> >> system because it's only depended upon by the Python bindings. > >>> >> > >>> >> And even if a solution could be devised, it would not wholly resolve > >>> >> the CI workflow issues. > >>> >> > >>> >> You could make Parquet completely independent of the Arrow codebase, > >>> >> but at that point there is little reason to maintain a relationship > >>> >> between the projects or their communities. We have spent a great > deal > >>> >> of effort refactoring the two projects to enable as much code > sharing > >>> >> as there is now. > >>> >> > >>> >> - Wes > >>> >> > >>> >> On Tue, Jul 31, 2018 at 2:29 PM, Wes McKinney <wesmck...@gmail.com> > >>> wrote: > >>> >> >> If you still strongly feel that the only way forward is to clone > the > >>> >> parquet-cpp repo and part ways, I will withdraw my concern. Having > two > >>> >> parquet-cpp repos is no way a better approach. > >>> >> > > >>> >> > Yes, indeed. In my view, the next best option after a monorepo is > to > >>> >> > fork. That would obviously be a bad outcome for the community. > >>> >> > > >>> >> > It doesn't look like I will be able to convince you that a > monorepo is > >>> >> > a good idea; what I would ask instead is that you be willing to > give > >>> >> > it a shot, and if it turns out in the way you're describing > (which I > >>> >> > don't think it will) then I suggest that we fork at that point. > >>> >> > > >>> >> > - Wes > >>> >> > > >>> >> > On Tue, Jul 31, 2018 at 2:14 PM, Deepak Majeti < > >>> majeti.dee...@gmail.com> > >>> >> wrote: > >>> >> >> Wes, > >>> >> >> > >>> >> >> Unfortunately, I cannot show you any practical fact-based > problems > >>> of a > >>> >> >> non-existent Arrow-Parquet mono-repo. > >>> >> >> Bringing in related Apache community experiences are more > meaningful > >>> >> than > >>> >> >> how mono-repos work at Google and other big organizations. > >>> >> >> We solely depend on volunteers and cannot hire full-time > developers. > >>> >> >> You are very well aware of how difficult it has been to find more > >>> >> >> contributors and maintainers for Arrow. parquet-cpp already has > a low > >>> >> >> contribution rate to its core components. > >>> >> >> > >>> >> >> We should target to ensure that new volunteers who want to > contribute > >>> >> >> bug-fixes/features should spend the least amount of time in > figuring > >>> out > >>> >> >> the project repo. We can never come up with an automated build > system > >>> >> that > >>> >> >> caters to every possible environment. > >>> >> >> My only concern is if the mono-repo will make it harder for new > >>> >> developers > >>> >> >> to work on parquet-cpp core just due to the additional code, > build > >>> and > >>> >> test > >>> >> >> dependencies. > >>> >> >> I am not saying that the Arrow community/committers will be less > >>> >> >> co-operative. > >>> >> >> I just don't think the mono-repo structure model will be > sustainable > >>> in > >>> >> an > >>> >> >> open source community unless there are long-term vested > interests. We > >>> >> can't > >>> >> >> predict that. > >>> >> >> > >>> >> >> The current circular dependency problems between Arrow and > Parquet > >>> is a > >>> >> >> major problem for the community and it is important. > >>> >> >> > >>> >> >> The current Arrow adaptor code for parquet should live in the > arrow > >>> >> repo. > >>> >> >> That will remove a majority of the dependency issues. > >>> >> >> Joshua's work would not have been blocked in parquet-cpp if that > >>> adapter > >>> >> >> was in the arrow repo. This will be similar to the ORC adaptor. > >>> >> >> > >>> >> >> The platform API code is pretty stable at this point. Minor > changes > >>> in > >>> >> the > >>> >> >> future to this code should not be the main reason to combine the > >>> arrow > >>> >> >> parquet repos. > >>> >> >> > >>> >> >> " > >>> >> >> *I question whether it's worth the community's time long term to > >>> wear* > >>> >> >> > >>> >> >> > >>> >> >> *ourselves out defining custom "ports" / virtual interfaces in > >>> >> eachlibrary > >>> >> >> to plug components together rather than utilizing commonplatform > >>> APIs.*" > >>> >> >> > >>> >> >> My answer to your question below would be "Yes". > >>> Modularity/separation > >>> >> is > >>> >> >> very important in an open source community where priorities of > >>> >> contributors > >>> >> >> are often short term. > >>> >> >> The retention is low and therefore the acquisition costs should > be > >>> low > >>> >> as > >>> >> >> well. This is the community over code approach according to me. > Minor > >>> >> code > >>> >> >> duplication is not a deal breaker. > >>> >> >> ORC, Parquet, Arrow, etc. are all different components in the big > >>> data > >>> >> >> space serving their own functions. > >>> >> >> > >>> >> >> If you still strongly feel that the only way forward is to clone > the > >>> >> >> parquet-cpp repo and part ways, I will withdraw my concern. > Having > >>> two > >>> >> >> parquet-cpp repos is no way a better approach. > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> > >>> >> >> On Tue, Jul 31, 2018 at 10:28 AM Wes McKinney < > wesmck...@gmail.com> > >>> >> wrote: > >>> >> >> > >>> >> >>> @Antoine > >>> >> >>> > >>> >> >>> > By the way, one concern with the monorepo approach: it would > >>> slightly > >>> >> >>> increase Arrow CI times (which are already too large). > >>> >> >>> > >>> >> >>> A typical CI run in Arrow is taking about 45 minutes: > >>> >> >>> https://travis-ci.org/apache/arrow/builds/410119750 > >>> >> >>> > >>> >> >>> Parquet run takes about 28 > >>> >> >>> https://travis-ci.org/apache/parquet-cpp/builds/410147208 > >>> >> >>> > >>> >> >>> Inevitably we will need to create some kind of bot to run > certain > >>> >> >>> builds on-demand based on commit / PR metadata or on request. > >>> >> >>> > >>> >> >>> The slowest build in Arrow (the Arrow C++/Python one) build > could be > >>> >> >>> made substantially shorter by moving some of the slower parts > (like > >>> >> >>> the Python ASV benchmarks) from being tested every-commit to > nightly > >>> >> >>> or on demand. Using ASAN instead of valgrind in Travis would > also > >>> >> >>> improve build times (valgrind build could be moved to a nightly > >>> >> >>> exhaustive test run) > >>> >> >>> > >>> >> >>> - Wes > >>> >> >>> > >>> >> >>> On Mon, Jul 30, 2018 at 10:54 PM, Wes McKinney < > wesmck...@gmail.com > >>> > > >>> >> >>> wrote: > >>> >> >>> >> I would like to point out that arrow's use of orc is a great > >>> >> example of > >>> >> >>> how it would be possible to manage parquet-cpp as a separate > >>> codebase. > >>> >> That > >>> >> >>> gives me hope that the projects could be managed separately some > >>> day. > >>> >> >>> > > >>> >> >>> > Well, I don't know that ORC is the best example. The ORC C++ > >>> codebase > >>> >> >>> > features several areas of duplicated logic which could be > >>> replaced by > >>> >> >>> > components from the Arrow platform for better platform-wide > >>> >> >>> > interoperability: > >>> >> >>> > > >>> >> >>> > > >>> >> >>> > >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ > >>> orc/OrcFile.hh#L37 > >>> >> >>> > > >>> >> > https://github.com/apache/orc/blob/master/c%2B%2B/include/orc/Int128.hh > >>> >> >>> > > >>> >> >>> > >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/include/ > >>> orc/MemoryPool.hh > >>> >> >>> > > >>> >> > https://github.com/apache/orc/blob/master/c%2B%2B/src/io/InputStream.hh > >>> >> >>> > > >>> >> https://github.com/apache/orc/blob/master/c%2B%2B/src/io/ > >>> OutputStream.hh > >>> >> >>> > > >>> >> >>> > ORC's use of symbols from Protocol Buffers was actually a > cause of > >>> >> >>> > bugs that we had to fix in Arrow's build system to prevent > them > >>> from > >>> >> >>> > leaking to third party linkers when statically linked (ORC is > only > >>> >> >>> > available for static linking at the moment AFAIK). > >>> >> >>> > > >>> >> >>> > I question whether it's worth the community's time long term > to > >>> wear > >>> >> >>> > ourselves out defining custom "ports" / virtual interfaces in > each > >>> >> >>> > library to plug components together rather than utilizing > common > >>> >> >>> > platform APIs. > >>> >> >>> > > >>> >> >>> > - Wes > >>> >> >>> > > >>> >> >>> > On Mon, Jul 30, 2018 at 10:45 PM, Joshua Storck < > >>> >> joshuasto...@gmail.com> > >>> >> >>> wrote: > >>> >> >>> >> You're point about the constraints of the ASF release > process are > >>> >> well > >>> >> >>> >> taken and as a developer who's trying to work in the current > >>> >> >>> environment I > >>> >> >>> >> would be much happier if the codebases were merged. The main > >>> issues > >>> >> I > >>> >> >>> worry > >>> >> >>> >> about when you put codebases like these together are: > >>> >> >>> >> > >>> >> >>> >> 1. The delineation of API's become blurred and the code > becomes > >>> too > >>> >> >>> coupled > >>> >> >>> >> 2. Release of artifacts that are lower in the dependency > tree are > >>> >> >>> delayed > >>> >> >>> >> by artifacts higher in the dependency tree > >>> >> >>> >> > >>> >> >>> >> If the project/release management is structured well and > someone > >>> >> keeps > >>> >> >>> an > >>> >> >>> >> eye on the coupling, then I don't have any concerns. > >>> >> >>> >> > >>> >> >>> >> I would like to point out that arrow's use of orc is a great > >>> >> example of > >>> >> >>> how > >>> >> >>> >> it would be possible to manage parquet-cpp as a separate > >>> codebase. > >>> >> That > >>> >> >>> >> gives me hope that the projects could be managed separately > some > >>> >> day. > >>> >> >>> >> > >>> >> >>> >> On Mon, Jul 30, 2018 at 10:23 PM Wes McKinney < > >>> wesmck...@gmail.com> > >>> >> >>> wrote: > >>> >> >>> >> > >>> >> >>> >>> hi Josh, > >>> >> >>> >>> > >>> >> >>> >>> > I can imagine use cases for parquet that don't involve > arrow > >>> and > >>> >> >>> tying > >>> >> >>> >>> them together seems like the wrong choice. > >>> >> >>> >>> > >>> >> >>> >>> Apache is "Community over Code"; right now it's the same > people > >>> >> >>> >>> building these projects -- my argument (which I think you > agree > >>> >> with?) > >>> >> >>> >>> is that we should work more closely together until the > community > >>> >> grows > >>> >> >>> >>> large enough to support larger-scope process than we have > now. > >>> As > >>> >> >>> >>> you've seen, our process isn't serving developers of these > >>> >> projects. > >>> >> >>> >>> > >>> >> >>> >>> > I also think build tooling should be pulled into its own > >>> >> codebase. > >>> >> >>> >>> > >>> >> >>> >>> I don't see how this can possibly be practical taking into > >>> >> >>> >>> consideration the constraints imposed by the combination of > the > >>> >> GitHub > >>> >> >>> >>> platform and the ASF release process. I'm all for being > >>> idealistic, > >>> >> >>> >>> but right now we need to be practical. Unless we can devise > a > >>> >> >>> >>> practical procedure that can accommodate at least 1 patch > per > >>> day > >>> >> >>> >>> which may touch both code and build system simultaneously > >>> without > >>> >> >>> >>> being a hindrance to contributor or maintainer, I don't see > how > >>> we > >>> >> can > >>> >> >>> >>> move forward. > >>> >> >>> >>> > >>> >> >>> >>> > That being said, I think it makes sense to merge the > codebases > >>> >> in the > >>> >> >>> >>> short term with the express purpose of separating them in > the > >>> near > >>> >> >>> term. > >>> >> >>> >>> > >>> >> >>> >>> I would agree but only if separation can be demonstrated to > be > >>> >> >>> >>> practical and result in net improvements in productivity and > >>> >> community > >>> >> >>> >>> growth. I think experience has clearly demonstrated that the > >>> >> current > >>> >> >>> >>> separation is impractical, and is causing problems. > >>> >> >>> >>> > >>> >> >>> >>> Per Julian's and Ted's comments, I think we need to consider > >>> >> >>> >>> development process and ASF releases separately. My > argument is > >>> as > >>> >> >>> >>> follows: > >>> >> >>> >>> > >>> >> >>> >>> * Monorepo for development (for practicality) > >>> >> >>> >>> * Releases structured according to the desires of the PMCs > >>> >> >>> >>> > >>> >> >>> >>> - Wes > >>> >> >>> >>> > >>> >> >>> >>> On Mon, Jul 30, 2018 at 9:31 PM, Joshua Storck < > >>> >> joshuasto...@gmail.com > >>> >> >>> > > >>> >> >>> >>> wrote: > >>> >> >>> >>> > I recently worked on an issue that had to be implemented > in > >>> >> >>> parquet-cpp > >>> >> >>> >>> > (ARROW-1644, ARROW-1599) but required changes in arrow > >>> >> (ARROW-2585, > >>> >> >>> >>> > ARROW-2586). I found the circular dependencies confusing > and > >>> >> hard to > >>> >> >>> work > >>> >> >>> >>> > with. For example, I still have a PR open in parquet-cpp > >>> >> (created on > >>> >> >>> May > >>> >> >>> >>> > 10) because of a PR that it depended on in arrow that was > >>> >> recently > >>> >> >>> >>> merged. > >>> >> >>> >>> > I couldn't even address any CI issues in the PR because > the > >>> >> change in > >>> >> >>> >>> arrow > >>> >> >>> >>> > was not yet in master. In a separate PR, I changed the > >>> >> >>> >>> run_clang_format.py > >>> >> >>> >>> > script in the arrow project only to find out later that > there > >>> >> was an > >>> >> >>> >>> exact > >>> >> >>> >>> > copy of it in parquet-cpp. > >>> >> >>> >>> > > >>> >> >>> >>> > However, I don't think merging the codebases makes sense > in > >>> the > >>> >> long > >>> >> >>> >>> term. > >>> >> >>> >>> > I can imagine use cases for parquet that don't involve > arrow > >>> and > >>> >> >>> tying > >>> >> >>> >>> them > >>> >> >>> >>> > together seems like the wrong choice. There will be other > >>> formats > >>> >> >>> that > >>> >> >>> >>> > arrow needs to support that will be kept separate (e.g. - > >>> Orc), > >>> >> so I > >>> >> >>> >>> don't > >>> >> >>> >>> > see why parquet should be special. I also think build > tooling > >>> >> should > >>> >> >>> be > >>> >> >>> >>> > pulled into its own codebase. GNU has had a long history > of > >>> >> >>> developing > >>> >> >>> >>> open > >>> >> >>> >>> > source C/C++ projects that way and made projects like > >>> >> >>> >>> > autoconf/automake/make to support them. I don't think CI > is a > >>> >> good > >>> >> >>> >>> > counter-example since there have been lots of successful > open > >>> >> source > >>> >> >>> >>> > projects that have used nightly build systems that pinned > >>> >> versions of > >>> >> >>> >>> > dependent software. > >>> >> >>> >>> > > >>> >> >>> >>> > That being said, I think it makes sense to merge the > codebases > >>> >> in the > >>> >> >>> >>> short > >>> >> >>> >>> > term with the express purpose of separating them in the > near > >>> >> term. > >>> >> >>> My > >>> >> >>> >>> > reasoning is as follows. By putting the codebases > together, > >>> you > >>> >> can > >>> >> >>> more > >>> >> >>> >>> > easily delineate the boundaries between the API's with a > >>> single > >>> >> PR. > >>> >> >>> >>> Second, > >>> >> >>> >>> > it will force the build tooling to converge instead of > >>> diverge, > >>> >> >>> which has > >>> >> >>> >>> > already happened. Once the boundaries and tooling have > been > >>> >> sorted > >>> >> >>> out, > >>> >> >>> >>> it > >>> >> >>> >>> > should be easy to separate them back into their own > codebases. > >>> >> >>> >>> > > >>> >> >>> >>> > If the codebases are merged, I would ask that the C++ > >>> codebases > >>> >> for > >>> >> >>> arrow > >>> >> >>> >>> > be separated from other languages. Looking at it from the > >>> >> >>> perspective of > >>> >> >>> >>> a > >>> >> >>> >>> > parquet-cpp library user, having a dependency on Java is a > >>> large > >>> >> tax > >>> >> >>> to > >>> >> >>> >>> pay > >>> >> >>> >>> > if you don't need it. For example, there were 25 JIRA's > in the > >>> >> 0.10.0 > >>> >> >>> >>> > release of arrow, many of which were holding up the > release. I > >>> >> hope > >>> >> >>> that > >>> >> >>> >>> > seems like a reasonable compromise, and I think it will > help > >>> >> reduce > >>> >> >>> the > >>> >> >>> >>> > complexity of the build/release tooling. > >>> >> >>> >>> > > >>> >> >>> >>> > > >>> >> >>> >>> > On Mon, Jul 30, 2018 at 8:50 PM Ted Dunning < > >>> >> ted.dunn...@gmail.com> > >>> >> >>> >>> wrote: > >>> >> >>> >>> > > >>> >> >>> >>> >> On Mon, Jul 30, 2018 at 5:39 PM Wes McKinney < > >>> >> wesmck...@gmail.com> > >>> >> >>> >>> wrote: > >>> >> >>> >>> >> > >>> >> >>> >>> >> > > >>> >> >>> >>> >> > > The community will be less willing to accept large > >>> >> >>> >>> >> > > changes that require multiple rounds of patches for > >>> >> stability > >>> >> >>> and > >>> >> >>> >>> API > >>> >> >>> >>> >> > > convergence. Our contributions to Libhdfs++ in the > HDFS > >>> >> >>> community > >>> >> >>> >>> took > >>> >> >>> >>> >> a > >>> >> >>> >>> >> > > significantly long time for the very same reason. > >>> >> >>> >>> >> > > >>> >> >>> >>> >> > Please don't use bad experiences from another open > source > >>> >> >>> community as > >>> >> >>> >>> >> > leverage in this discussion. I'm sorry that things > didn't > >>> go > >>> >> the > >>> >> >>> way > >>> >> >>> >>> >> > you wanted in Apache Hadoop but this is a distinct > >>> community > >>> >> which > >>> >> >>> >>> >> > happens to operate under a similar open governance > model. > >>> >> >>> >>> >> > >>> >> >>> >>> >> > >>> >> >>> >>> >> There are some more radical and community building > options as > >>> >> well. > >>> >> >>> Take > >>> >> >>> >>> >> the subversion project as a precedent. With subversion, > any > >>> >> Apache > >>> >> >>> >>> >> committer can request and receive a commit bit on some > large > >>> >> >>> fraction of > >>> >> >>> >>> >> subversion. > >>> >> >>> >>> >> > >>> >> >>> >>> >> So why not take this a bit further and give every parquet > >>> >> committer > >>> >> >>> a > >>> >> >>> >>> >> commit bit in Arrow? Or even make them be first class > >>> >> committers in > >>> >> >>> >>> Arrow? > >>> >> >>> >>> >> Possibly even make it policy that every Parquet > committer who > >>> >> asks > >>> >> >>> will > >>> >> >>> >>> be > >>> >> >>> >>> >> given committer status in Arrow. > >>> >> >>> >>> >> > >>> >> >>> >>> >> That relieves a lot of the social anxiety here. Parquet > >>> >> committers > >>> >> >>> >>> can't be > >>> >> >>> >>> >> worried at that point whether their patches will get > merged; > >>> >> they > >>> >> >>> can > >>> >> >>> >>> just > >>> >> >>> >>> >> merge them. Arrow shouldn't worry much about inviting > in the > >>> >> >>> Parquet > >>> >> >>> >>> >> committers. After all, Arrow already depends a lot on > >>> parquet so > >>> >> >>> why not > >>> >> >>> >>> >> invite them in? > >>> >> >>> >>> >> > >>> >> >>> >>> > >>> >> >>> > >>> >> >> > >>> >> >> > >>> >> >> -- > >>> >> >> regards, > >>> >> >> Deepak Majeti > >>> >> > >>> > > >>> > > >>> > -- > >>> > regards, > >>> > Deepak Majeti > >>> > -- regards, Deepak Majeti