Hi all,

I wanted to bring up a topic that there isn't a 100% perfect solution for, but 
that's been bothering the team at Berkeley for a while: consolidating Spark's 
build system. Right now we have two build systems, Maven and SBT, that need to 
be maintained together on each change. We added Maven a while back to try it as 
an alternative to SBT and to get some better publishing options, like Debian 
packages and classifiers, but we've found that 1) SBT has actually been fairly 
stable since then (unlike the rapid release cycle before) and 2) classifiers 
don't actually seem to work for publishing versions of Spark with different 
dependencies (you need to give them different artifact names). More importantly 
though, because maintaining two systems is confusing, it would be good to 
converge to just one soon, or to find a better way of maintaining the builds.

In terms of which system to go for, neither is perfect, but I think many of us 
are leaning toward SBT, because it's noticeably faster and it has less code to 
maintain. If we do this, however, I'd really like to understand the use cases 
for Maven, and make sure that either we can support them in SBT or we can do 
them externally. Can people say a bit about that? The ones I've thought of are 
the following:

- Debian packaging -- this is certainly nice, but there are some plugins for 
SBT too so may be possible to migrate.
- BigTop integration; I'm not sure how much this relies on Maven but Cos has 
been using it.
- Classifiers for hadoop1 and hadoop2 -- as far as I can tell, these don't 
really work if you want to publish to Maven Central; you still need two 
artifact names because the artifacts have different dependencies. However, more 
importantly, we'd like to make Spark work with all Hadoop versions by using 
hadoop-client and a bit of reflection, similar to how projects like Parquet 
handle this.

Are there other things I'm missing here, or other ways to handle this problem 
that I'm missing? For example, one possibility would be to keep the Maven build 
scripts in a separate repo managed by the people who want to use them, or to 
have some dedicated maintainers for them. But because this is often an issue, I 
do think it would be simpler for the project to have one build system in the 
long term. In either case though, we will keep the project structure compatible 
with Maven, so people who want to use it internally should be fine; I think 
that we've done this well and, if anything, we've simplified the Maven build 
process lately by removing Twirl.

Anyway, as I said, I don't think any solution is perfect here, but I'm curious 
to hear your input.

Matei

Reply via email to