I think these are both interesting areas to explore further. I'd like
to focus on the couple of immediate items I think we should address

* Should optional components be "opt in", "out out", or a mix?
Currently it's a mix, and that's confusing for people. I think we
should make them all "opt in".
* Do we want to bring the out-of-the-box core build down to zero
dependencies, including not depending on boost::filesystem and
possibly checking the compiled Flatbuffers files. While it may be
slightly more maintenance work, I think the optics of a
"dependency-free" core build would be beneficial and help the project
marketing-wise.

Both of these issues must be addressed whether we undertake a Bazel
implementation or some other refactor of the C++ build system.

On Wed, Sep 18, 2019 at 2:48 AM Uwe L. Korn <uw...@xhochy.com> wrote:
>
> Hello Micah,
>
> I don't think we have explored using bazel yet. I would see it as a possible 
> modular alternative but as you mention it will be a lot of work and we would 
> probably need a mentor who is familiar with bazel, otherwise we probably end 
> up spending too much time on this and get a non-typical bazel setup.
>
> Uwe
>
> On Wed, Sep 18, 2019, at 8:44 AM, Micah Kornfield wrote:
> > It has come up in the past, but I wonder if exploring Bazel as a build
> > system with its a very explicit dependency graph might help (I'm not sure
> > if something similar is available in CMake).
> >
> > This is also a lot of work, but could also potentially benefit the
> > developer experience because we can make unit tests depend on individual
> > compilable units instead of all of libarrow.  There are trade-offs here as
> > well in terms of public API coverage.
> >
> > On Tue, Sep 17, 2019 at 11:14 PM Uwe L. Korn <uw...@xhochy.com> wrote:
> >
> > > Hello,
> > >
> > > I can think of two other alternatives that make it more visible what Arrow
> > > core is and what are the optional components:
> > >
> > > * Error out when no component is selected instead of building just the
> > > core Arrow. Here we could add an explanative message that list all
> > > components and for each component 2-3 words what it does and what it
> > > requires. This would make the first-time experience much better.
> > > * Split the CMake project into several subprojects. By correctly
> > > structuring the CMakefiles, we should be able to separate out the Arrow
> > > components into separate CMake projects that can be built independently if
> > > needed while all using the same third-party toolchain. We would still have
> > > a top-level CMakeLists.txt that is invoked just like the current one but
> > > through having subprojects, you would not anymore be bound to use the
> > > single top-level one. This would also have some benefit for packagers that
> > > could separate out the build of individual Arrow modules. Furthermore, it
> > > would also make it easier for PoC/academic projects to just take the Arrow
> > > Core sources and drop it in as a CMake subproject; while this is not a 
> > > good
> > > solution for production-grade software, it is quite common practice to do
> > > this in research.
> > > I really like this approach and I think this is something we should have
> > > as a long-term target, I'm also happy to implement given the time but I
> > > think one CMake refactor per year is the maximum I can do and that was
> > > already eaten up by the dependency detection. Also, I'm unsure about how
> > > much this would block us at the moment vs the marketing benefit of having 
> > > a
> > > more modular Arrow; currently I'm leaning on the side that the
> > > marketing/adoption benefit would be much larger but we lack someone
> > > frustration-tolerant to do the refactoring.
> > >
> > > Uwe
> > >
> > > On Wed, Sep 18, 2019, at 12:18 AM, Wes McKinney wrote:
> > > > hi folks,
> > > >
> > > > Lately there seem to be more and more people suggesting that the
> > > > optional components in the Arrow C++ project are getting in the way of
> > > > using the "core" which implements the columnar format and IPC
> > > > protocol. I am not sure I agree with this argument, but in general I
> > > > think it would be a good idea to make all optional components in the
> > > > project "opt in" rather than "opt out"
> > > >
> > > > To demonstrate where things currently stand, I created a Dockerfile to
> > > > try to make the smallest possible and most dependency-free build
> > > >
> > > >
> > > https://github.com/wesm/arrow/tree/cpp-minimal-dockerfile/dev/cpp_minimal
> > > >
> > > > Here is the output of this build
> > > >
> > > > https://gist.github.com/wesm/02328fbb463033ed486721b8265f755f
> > > >
> > > > First, let's look at the CMake invocation
> > > >
> > > > cmake .. -DBOOST_SOURCE=BUNDLED \
> > > > -DARROW_BOOST_USE_SHARED=OFF \
> > > > -DARROW_COMPUTE=OFF \
> > > > -DARROW_DATASET=OFF \
> > > > -DARROW_JEMALLOC=OFF \
> > > > -DARROW_JSON=ON \
> > > > -DARROW_USE_GLOG=OFF \
> > > > -DARROW_WITH_BZ2=OFF \
> > > > -DARROW_WITH_ZLIB=OFF \
> > > > -DARROW_WITH_ZSTD=OFF \
> > > > -DARROW_WITH_LZ4=OFF \
> > > > -DARROW_WITH_SNAPPY=OFF \
> > > > -DARROW_WITH_BROTLI=OFF \
> > > > -DARROW_BUILD_UTILITIES=OFF
> > > >
> > > > Aside from the issue of how to obtain and link Boost, here's a couple of
> > > things:
> > > >
> > > > * COMPUTE and DATASET IMHO should be off by default
> > > > * All compression libraries should be turned off
> > > > * GLOG should be off by default
> > > > * Utilities should be off (they are used for integration testing)
> > > > * Jemalloc should probably be off, but we should make it clear that
> > > > opting in will yield better performance
> > > >
> > > > I found that it wasn't possible to set ARROW_JSON=OFF without breaking
> > > > the build. I opened ARROW-6590 to fix this
> > > >
> > > > Aside from potentially changing these defaults, there's some things in
> > > > the build that we might want to turn into optional pieces:
> > > >
> > > > * We should see if we can make boost::filesystem not mandatory in the
> > > > barebones build, if only to satisfy the peanut gallery
> > > > * double-conversion is used in the CSV module. I think that
> > > > double-conversion_ep and the CSV module should both be made opt-in
> > > > * rapidjson_ep should be made optional. JSON support is only needed
> > > > for integration testing
> > > >
> > > > We could also discuss vendoring flatbuffers.h so that flatbuffers_ep
> > > > is not mandatory.
> > > >
> > > > In general, enabling optional components is primarily relevant for
> > > > packagers. If we implement these changes, a number of package build
> > > > scripts will have to change.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> > >
> >

Reply via email to