Currently we create a SparkMahoutContext, and use “mahout -spark classpath” to create the SparkContext. the SparkConf is also directly accessed. If we move to using spark-submit for launching the Mahout Shell and other drivers we would need to refactor some of this and change the mahout script. It seems desirable to have and driver code create the Spark context and rely on spark-submit for any config overrides and params. This implies the possible removal (not sure about this) of SparkMahoutContext. In general it would be nice if this were done outside of Mahout, or limited to the drivers and shell. Mahout has become a library that is designed to be backend independent. This code was designed before this became a goal and is beyond my understanding to fully grasp how much work would be involved and what would replace it.
The code refactoring needed is not well understood, by me at least. But intuition says that with a growing number of backends it might be good to clean up the Spark dependencies for context management. This has also been a bit of a problem in creating apps that use Mahout since typical spark-submit use cannot be relied on to make config changes, they must be made in environment variables only. These arguably non-standard manipulation of the context puts limitations and hidden assumptions into using Mahout as a library. Doing all of this implies a fairly large bit of work, I think. The benefit is that it will be more clear how to use Mahout as a library and in cleaning up some unneeded code. I’m not sure I have enough time to do all of this myself. This isn’t so much a proposal as a call for discussion.
