It has come up in the past, but I wonder if exploring Bazel as a build
system with its a very explicit dependency graph might help (I'm not sure
if something similar is available in CMake).

This is also a lot of work, but could also potentially benefit the
developer experience because we can make unit tests depend on individual
compilable units instead of all of libarrow.  There are trade-offs here as
well in terms of public API coverage.

On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn <uw...@xhochy.com> wrote:

> Hello,
>
> I can think of two other alternatives that make it more visible what Arrow
> core is and what are the optional components:
>
> * Error out when no component is selected instead of building just the
> core Arrow. Here we could add an explanative message that list all
> components and for each component 2-3 words what it does and what it
> requires. This would make the first-time experience much better.
> * Split the CMake project into several subprojects. By correctly
> structuring the CMakefiles, we should be able to separate out the Arrow
> components into separate CMake projects that can be built independently if
> needed while all using the same third-party toolchain. We would still have
> a top-level CMakeLists.txt that is invoked just like the current one but
> through having subprojects, you would not anymore be bound to use the
> single top-level one. This would also have some benefit for packagers that
> could separate out the build of individual Arrow modules. Furthermore, it
> would also make it easier for PoC/academic projects to just take the Arrow
> Core sources and drop it in as a CMake subproject; while this is not a good
> solution for production-grade software, it is quite common practice to do
> this in research.
> I really like this approach and I think this is something we should have
> as a long-term target, I'm also happy to implement given the time but I
> think one CMake refactor per year is the maximum I can do and that was
> already eaten up by the dependency detection. Also, I'm unsure about how
> much this would block us at the moment vs the marketing benefit of having a
> more modular Arrow; currently I'm leaning on the side that the
> marketing/adoption benefit would be much larger but we lack someone
> frustration-tolerant to do the refactoring.
>
> Uwe
>
> On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > hi folks,
> >
> > Lately there seem to be more and more people suggesting that the
> > optional components in the Arrow C++ project are getting in the way of
> > using the "core" which implements the columnar format and IPC
> > protocol. I am not sure I agree with this argument, but in general I
> > think it would be a good idea to make all optional components in the
> > project "opt in" rather than "opt out"
> >
> > To demonstrate where things currently stand, I created a Dockerfile to
> > try to make the smallest possible and most dependency-free build
> >
> >
> https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> >
> > Here is the output of this build
> >
> > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> >
> > First, let's look at the CMake invocation
> >
> > cmake .. -DBOOST_SOURCE=BUNDLED \
> > -DARROW_BOOST_USE_SHARED=OFF \
> > -DARROW_COMPUTE=OFF \
> > -DARROW_DATASET=OFF \
> > -DARROW_JEMALLOC=OFF \
> > -DARROW_JSON=ON \
> > -DARROW_USE_GLOG=OFF \
> > -DARROW_WITH_BZ2=OFF \
> > -DARROW_WITH_ZLIB=OFF \
> > -DARROW_WITH_ZSTD=OFF \
> > -DARROW_WITH_LZ4=OFF \
> > -DARROW_WITH_SNAPPY=OFF \
> > -DARROW_WITH_BROTLI=OFF \
> > -DARROW_BUILD_UTILITIES=OFF
> >
> > Aside from the issue of how to obtain and link Boost, here's a couple of
> things:
> >
> > * COMPUTE and DATASET IMHO should be off by default
> > * All compression libraries should be turned off
> > * GLOG should be off by default
> > * Utilities should be off (they are used for integration testing)
> > * Jemalloc should probably be off, but we should make it clear that
> > opting in will yield better performance
> >
> > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > the build. I opened ARROW-6590 to fix this
> >
> > Aside from potentially changing these defaults, there's some things in
> > the build that we might want to turn into optional pieces:
> >
> > * We should see if we can make boost::filesystem not mandatory in the
> > barebones build, if only to satisfy the peanut gallery
> > * double-conversion is used in the CSV module. I think that
> > double-conversion_ep and the CSV module should both be made opt-in
> > * rapidjson_ep should be made optional. JSON support is only needed
> > for integration testing
> >
> > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> > is not mandatory.
> >
> > In general, enabling optional components is primarily relevant for
> > packagers. If we implement these changes, a number of package build
> > scripts will have to change.
> >
> > Thanks,
> > Wes
> >
>

Reply via email to