I think these are both interesting areas to explore further. I'd like to focus on the couple of immediate items I think we should address
* Should optional components be "opt in", "out out", or a mix? Currently it's a mix, and that's confusing for people. I think we should make them all "opt in". * Do we want to bring the out-of-the-box core build down to zero dependencies, including not depending on boost::filesystem and possibly checking the compiled Flatbuffers files. While it may be slightly more maintenance work, I think the optics of a "dependency-free" core build would be beneficial and help the project marketing-wise. Both of these issues must be addressed whether we undertake a Bazel implementation or some other refactor of the C++ build system. On Wed, Sep 18, 2019 at 2:48 AM Uwe L. Korn <uw...@xhochy.com> wrote: > > Hello Micah, > > I don't think we have explored using bazel yet. I would see it as a possible > modular alternative but as you mention it will be a lot of work and we would > probably need a mentor who is familiar with bazel, otherwise we probably end > up spending too much time on this and get a non-typical bazel setup. > > Uwe > > On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote: > > It has come up in the past, but I wonder if exploring Bazel as a build > > system with its a very explicit dependency graph might help (I'm not sure > > if something similar is available in CMake). > > > > This is also a lot of work, but could also potentially benefit the > > developer experience because we can make unit tests depend on individual > > compilable units instead of all of libarrow. There are trade-offs here as > > well in terms of public API coverage. > > > > On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn <uw...@xhochy.com> wrote: > > > > > Hello, > > > > > > I can think of two other alternatives that make it more visible what Arrow > > > core is and what are the optional components: > > > > > > * Error out when no component is selected instead of building just the > > > core Arrow. Here we could add an explanative message that list all > > > components and for each component 2-3 words what it does and what it > > > requires. This would make the first-time experience much better. > > > * Split the CMake project into several subprojects. By correctly > > > structuring the CMakefiles, we should be able to separate out the Arrow > > > components into separate CMake projects that can be built independently if > > > needed while all using the same third-party toolchain. We would still have > > > a top-level CMakeLists.txt that is invoked just like the current one but > > > through having subprojects, you would not anymore be bound to use the > > > single top-level one. This would also have some benefit for packagers that > > > could separate out the build of individual Arrow modules. Furthermore, it > > > would also make it easier for PoC/academic projects to just take the Arrow > > > Core sources and drop it in as a CMake subproject; while this is not a > > > good > > > solution for production-grade software, it is quite common practice to do > > > this in research. > > > I really like this approach and I think this is something we should have > > > as a long-term target, I'm also happy to implement given the time but I > > > think one CMake refactor per year is the maximum I can do and that was > > > already eaten up by the dependency detection. Also, I'm unsure about how > > > much this would block us at the moment vs the marketing benefit of having > > > a > > > more modular Arrow; currently I'm leaning on the side that the > > > marketing/adoption benefit would be much larger but we lack someone > > > frustration-tolerant to do the refactoring. > > > > > > Uwe > > > > > > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote: > > > > hi folks, > > > > > > > > Lately there seem to be more and more people suggesting that the > > > > optional components in the Arrow C++ project are getting in the way of > > > > using the "core" which implements the columnar format and IPC > > > > protocol. I am not sure I agree with this argument, but in general I > > > > think it would be a good idea to make all optional components in the > > > > project "opt in" rather than "opt out" > > > > > > > > To demonstrate where things currently stand, I created a Dockerfile to > > > > try to make the smallest possible and most dependency-free build > > > > > > > > > > > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal > > > > > > > > Here is the output of this build > > > > > > > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f > > > > > > > > First, let's look at the CMake invocation > > > > > > > > cmake .. -DBOOST_SOURCE=BUNDLED \ > > > > -DARROW_BOOST_USE_SHARED=OFF \ > > > > -DARROW_COMPUTE=OFF \ > > > > -DARROW_DATASET=OFF \ > > > > -DARROW_JEMALLOC=OFF \ > > > > -DARROW_JSON=ON \ > > > > -DARROW_USE_GLOG=OFF \ > > > > -DARROW_WITH_BZ2=OFF \ > > > > -DARROW_WITH_ZLIB=OFF \ > > > > -DARROW_WITH_ZSTD=OFF \ > > > > -DARROW_WITH_LZ4=OFF \ > > > > -DARROW_WITH_SNAPPY=OFF \ > > > > -DARROW_WITH_BROTLI=OFF \ > > > > -DARROW_BUILD_UTILITIES=OFF > > > > > > > > Aside from the issue of how to obtain and link Boost, here's a couple of > > > things: > > > > > > > > * COMPUTE and DATASET IMHO should be off by default > > > > * All compression libraries should be turned off > > > > * GLOG should be off by default > > > > * Utilities should be off (they are used for integration testing) > > > > * Jemalloc should probably be off, but we should make it clear that > > > > opting in will yield better performance > > > > > > > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking > > > > the build. I opened ARROW-6590 to fix this > > > > > > > > Aside from potentially changing these defaults, there's some things in > > > > the build that we might want to turn into optional pieces: > > > > > > > > * We should see if we can make boost::filesystem not mandatory in the > > > > barebones build, if only to satisfy the peanut gallery > > > > * double-conversion is used in the CSV module. I think that > > > > double-conversion_ep and the CSV module should both be made opt-in > > > > * rapidjson_ep should be made optional. JSON support is only needed > > > > for integration testing > > > > > > > > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep > > > > is not mandatory. > > > > > > > > In general, enabling optional components is primarily relevant for > > > > packagers. If we implement these changes, a number of package build > > > > scripts will have to change. > > > > > > > > Thanks, > > > > Wes > > > > > > > > >