Hi all, I wanted to bring up a topic that there isn't a 100% perfect solution for, but that's been bothering the team at Berkeley for a while: consolidating Spark's build system. Right now we have two build systems, Maven and SBT, that need to be maintained together on each change. We added Maven a while back to try it as an alternative to SBT and to get some better publishing options, like Debian packages and classifiers, but we've found that 1) SBT has actually been fairly stable since then (unlike the rapid release cycle before) and 2) classifiers don't actually seem to work for publishing versions of Spark with different dependencies (you need to give them different artifact names). More importantly though, because maintaining two systems is confusing, it would be good to converge to just one soon, or to find a better way of maintaining the builds.
In terms of which system to go for, neither is perfect, but I think many of us are leaning toward SBT, because it's noticeably faster and it has less code to maintain. If we do this, however, I'd really like to understand the use cases for Maven, and make sure that either we can support them in SBT or we can do them externally. Can people say a bit about that? The ones I've thought of are the following: - Debian packaging -- this is certainly nice, but there are some plugins for SBT too so may be possible to migrate. - BigTop integration; I'm not sure how much this relies on Maven but Cos has been using it. - Classifiers for hadoop1 and hadoop2 -- as far as I can tell, these don't really work if you want to publish to Maven Central; you still need two artifact names because the artifacts have different dependencies. However, more importantly, we'd like to make Spark work with all Hadoop versions by using hadoop-client and a bit of reflection, similar to how projects like Parquet handle this. Are there other things I'm missing here, or other ways to handle this problem that I'm missing? For example, one possibility would be to keep the Maven build scripts in a separate repo managed by the people who want to use them, or to have some dedicated maintainers for them. But because this is often an issue, I do think it would be simpler for the project to have one build system in the long term. In either case though, we will keep the project structure compatible with Maven, so people who want to use it internally should be fine; I think that we've done this well and, if anything, we've simplified the Maven build process lately by removing Twirl. Anyway, as I said, I don't think any solution is perfect here, but I'm curious to hear your input. Matei
