Not the nesting, but pulling a lot of unused files. On Wed, Oct 17, 2018 at 12:39 PM Wes McKinney <wesmck...@gmail.com> wrote:
> Why would one level of directory nesting cause awkwardness (curious)? > > On Wed, Oct 17, 2018, 12:28 PM Francois Saint-Jacques < > fsaintjacq...@networkdump.com> wrote: > >> One point toward seperate repositories, vendoring Arrow for C++ project >> with git submodules becomes awkward if it's a multi-lang monorepo. >> >> On Tue, Oct 16, 2018 at 9:22 PM Wes McKinney <wesmck...@gmail.com> wrote: >> >> > I would also add -- Krisztian's recent work Dockerizing the project is >> > setting us up to be able to decouple ourselves from Travis CI. We need >> > build hosts where we can use Docker to be able to do this, though. >> > Preferably the build hosts would have NVIDIA GPUs so we can use >> > nvidia-docker to test our GPU functionality >> > On Tue, Oct 16, 2018 at 9:09 PM Wes McKinney <wesmck...@gmail.com> >> wrote: >> > > >> > > hi Antoine, >> > > >> > > Some small critiques to the listing of implementations: >> > > >> > > * The Java library predates the C++ library (it originated in Apache >> > Drill) >> > > * Python and C++ both interact with the Java library in different >> > > ways. There's JNI for Gandiva and Plasma, and Python uses Java via >> > > JPype in unit tests >> > > >> > > There's some critical questions to answer here: >> > > >> > > 1. Is there such a thing as an "independent implementation"? >> > > 2. What's the best way to manage changesets / patches? >> > > 3. What is the best way to manage the burgeoning complexity of testing >> > > and verification of the entire project? >> > > 4. How much longer will public CI services be adequate for our needs? >> > > >> > > This may be a bit long winded so bear with me >> > > >> > > 1. Is there such a thing as an "independent implementation"? >> > > >> > > My answer to this is actually "not really". The reasons are as >> follows: >> > > >> > > * The integration tests are one of the most important parts of the >> > > project. While C++, Java, and JavaScript are the only participants, we >> > > eventually need Rust, Go, and C# to be in the matrix. This will >> > > include integration testing for RPC / Flight in addition to the >> > > current IPC tests. >> > > * By the nature of Arrow, any implementation may build in-memory or >> > > RPC-based bindings to computational libraries that are in C++ or use >> > > LLVM, such as Gandiva and Plasma. This is already the case in Java, >> > > and may expand beyond Java. I could see Go or Rust or C# using Gandiva >> > > or Plasma. The scope of what kinds of shared infrastructure might be >> > > used in multiple languages will only expand over time >> > > >> > > 2. What's the best way to manage changesets / patches? >> > > >> > > * Because no two implementations can be guaranteed to be independent, >> > > in a non-monorepo setup, changes may require multiple patches. >> > > Verifying "joint patches" is likely to require manual / human >> > > intervention in ways that are a non-issue for a monorepo >> > > * Splitting development up into multiple repositories will decrease >> > > visibility into the patch queues in the less active subprojects. I'm >> > > strongly in support not only of a single codebase but a single patch >> > > queue. I admit that seeing ~70 open pull requests on Arrow stresses me >> > > out a bit, but having 70 patches spread across 5 repos would be more >> > > stressful for me at least >> > > * Broken builds in any part of the project should be a concern to the >> > > entire community -- we should not have broken builds. I'd be concerned >> > > about having any part of the project becoming a "ghetto" if the >> > > plurality of developers are working elsewhere with an "out of sight, >> > > out of mind" mindset >> > > >> > > To play devil's advocate, some web applications could be developed to >> > > create the appearance of a unified patch queue across many repos. >> > > >> > > That being said, our patch queue pales in comparison to some larger / >> > > more mature ASF projects: >> > > >> > > * Spark has 523 open PRs: https://github.com/apache/spark/pulls >> > > * Airflow has 218 open PRs: >> > https://github.com/apache/incubator-airflow/pulls >> > > * Hadoop 195 open PRs: https://github.com/apache/hadoop/pulls >> > > >> > > 3. What is the best way to manage the burgeoning complexity of testing >> > > and verification of the entire project? >> > > 4. How much longer will public CI services be adequate for our needs? >> > > >> > > I think we are already reaching the limits of what we can reasonably >> > > accomplish with public CI services. Apache Arrow is a project with >> > > sophistication and scope that is destined to outgrow what Travis CI >> > > can provide within the scope of a single implementation, i.e. >> > > C++/Python. For example, we're going to be past the 50 minute time >> > > limit before too long. I think that continuing to constrain ourselves >> > > by the 50 minute time limit will also limit the scope of what kinds of >> > > automated testing we can employ, to our long term detriment. We also >> > > have things (like GPU support) that we cannot test there. >> > > >> > > Considering more mature data projects in the ASF that I'm familiar >> > > with: Kudu, Impala, Spark: none of these projects use Travis CI. Their >> > > testing uses Jenkins build slaves and run much longer than our CI >> > > jobs. If we used beefier build slaves, our builds would also run much >> > > faster. >> > > >> > > So, what should we do? Well, part of why I have recently created an >> > > organization (https://ursalabs.org/) dedicated to Arrow development >> is >> > > to have the financial means and the engineering resources to actually >> > > do something about problems like these. I would propose to make an >> > > investment of hardware and engineering time to augment our ability to >> > > test the repository to make sure we can manage 5-10x the current test >> > > runtime that we have now. If I have to personally halt feature >> > > development and focus on build and development tooling for a while, so >> > > be it. We've already spent many months this year on packaging >> > > automation but we are still coming up short in development tooling. If >> > > anyone reading has funds to invest in hardware resources, please let >> > > me know. >> > > >> > > As Clint Eastwood's character said in "The Good, The Bad, and The >> > > Ugly", "$200,000 is a lot of money. We're gonna have to earn it." >> > > >> > > FWIW: I am not sure Parquet is a good example of a better way to be. >> > > Parquet lacks automated integration tests (terrifying to me) and >> > > failed to grow a community outside of the Java world until 2016 when a >> > > few of us started building out the C++ library. >> > > >> > > - Wes >> > > On Tue, Oct 16, 2018 at 1:02 PM Antoine Pitrou <anto...@python.org> >> > wrote: >> > > > >> > > > >> > > > Hello, >> > > > >> > > > We are quickly growing the number of Arrow implementations. Soon >> we'll >> > > > have: >> > > > - C++: the most mature, reference, and historical implementation >> > > > - Python: linked with Arrow C++ >> > > > - C/GLib: linked with Arrow C++ >> > > > - Ruby: linked with Arrow C++ (indirectly through C/GLib) >> > > > - R: linked with Arrow C++ >> > > > - Matlab: linked with Arrow C++ >> > > > - Java: independent implementation >> > > > - Rust: independent implementation >> > > > - Go: independent implementation >> > > > - Javascript: independent implementation >> > > > - .Net (C#): independent implementation >> > > > >> > > > This creates various kinds of issues. Technical issues such as CI >> > > > matrices being more and more large and complex. Social issues such >> as >> > > > different implementations having different development speeds and >> > > > maturity, and the fact that development teams are effectively >> disjoint >> > > > (for example, whoever develops on the C++ codebase usually doesn't >> > > > develop on the Rust codebase, and vice-versa). >> > > > >> > > > I'm not proposing anything concrete here, but would like to ask what >> > > > people think of moving independent implementations (those that don't >> > > > depend on Arrow C++) into independent repositories. This would let >> > them >> > > > define their own workflow, permissions, teams, CI configurations and >> > > > whatnot. This would also allow growing the CI matrix for the main >> repo >> > > > without reaching humongous sizes. The implementations would still >> be >> > > > under the umbrella of the Apache Arrow project; but they would >> exist as >> > > > independent GitHub projects (this is a bit how Parquet >> implementations >> > > > are handled, AFAIK). >> > > > >> > > > To start with, Wes expressed opposition to the idea: >> > > > """ >> > > > I am against breaking up the monorepo -- I think that we should >> scale >> > > > our process using tools that we develop rather than conforming to >> the >> > > > objectively crude affordances of Travis CI and Appveyor. >> > Implementations >> > > > that are independent now may not be so in the future by the nature >> of >> > > > the project -- any implementation could integrate with Gandiva, for >> > > > example, and that would become much more difficult to develop if the >> > > > code is fragmented in multiple repositories. >> > > > """ >> > > > >> > > > (https://github.com/apache/arrow/pull/2765#issuecomment-430224701) >> > > > >> > > > Regards >> > > > >> > > > Antoine. >> > >> >> >> -- >> Sent from my jetpack. >> > -- Sent from my jetpack.