Currently we create a SparkMahoutContext, and use “mahout -spark classpath” to 
create the SparkContext. the SparkConf is also directly accessed. If we move to 
using spark-submit for launching the Mahout Shell and other drivers we would 
need to refactor some of this and change the mahout script. It seems desirable 
to have and driver code create the Spark context and rely on spark-submit for 
any config overrides and params. This implies the possible removal (not sure 
about this) of SparkMahoutContext. In general it would be nice if this were 
done outside of Mahout, or limited to the drivers and shell. Mahout has become 
a library that is designed to be backend independent. This code was designed 
before this became a goal and is beyond my understanding to fully grasp how 
much work would be involved and what would replace it.

The code refactoring needed is not well understood, by me at least. But 
intuition says that with a growing number of backends it might be good to clean 
up the Spark dependencies for context management. This has also been a bit of a 
problem in creating apps that use Mahout since typical spark-submit use cannot 
be relied on to make config changes, they must be made in environment variables 
only. These arguably non-standard manipulation of the context puts limitations 
and hidden assumptions into using Mahout as a library. 

Doing all of this implies a fairly large bit of work, I think. The benefit is 
that it will be more clear how to use Mahout as a library and in cleaning up 
some unneeded code. I’m not sure I have enough time to do all of this myself. 

This isn’t so much a proposal as a call for discussion.


Reply via email to