On Sat, Nov 28, 2015 at 10:55 AM, Pat Ferrel <[email protected]> wrote:
> I use spark-submit also to launch apps that use Mahout so not sure what > assumptions you are talking about. Ok so if it works what's the problem. I am lost. I am talking about assumptions that anything dealing with context needs to be changed or even removed. > The first thing is to use spark-submit in our own launch script. > What script would that be? > The current code calls the CLI mahout script to get classpath info, this > should be passed in to the Which code? mahout context creation? As i said, you can customize that behavior. You can tell it not to look for standard jars + get your own jars into classpath. Should be flexible enough to handle any startup situation. > spark-submit so if we launch with spark-submit I think the call of the > mahout script would be unnecessary. This makes it more straightforward to > use with Yarn cluster mode where the client/driver is launched on some > cluster machine where there would be no script to call. > Again, see comment above. Yes, i did submits to yarn and standalone, you name it. it is all good. > > If the SparkMahoutContext is a hard requirement that’s fine. Every single operation uses context (which essentially wraps backend context). it is not passed in, it is implied by a dataset parameter. No physical operator can work without it. For most part, context is required because the backend engines require a session equivalent of it (SparkContext in Spark's case). This is more a hard requirement on the backend part. > As I said, I don’t understand all of those ramifications. > > On Nov 27, 2015, at 8:25 PM, Dmitriy Lyubimov <[email protected]> wrote: > > I do submits all the time, don't see any problem. It is part of my standard > stress test harness. > > Mahout context is conceptual and cannot be removed, nor it is required to > be removed in order to run submitted jobs. Submission and contexts are two > completely separate concepts. One can submit a job that for example doesn't > set up a spark job at all and runs for example a Mr job, or just > manipulates some HDFS directories, or sets up multiple jobs or combinations > of all of the above. All submission means is sending an Uber jar to an > application server and launching a main class there, instead of doing the > same locally. Not sure where these all assumptions are coming from. > On Nov 27, 2015 11:33 AM, "Pat Ferrel" <[email protected]> wrote: > > > Currently we create a SparkMahoutContext, and use “mahout -spark > > classpath” to create the SparkContext. the SparkConf is also directly > > accessed. If we move to using spark-submit for launching the Mahout Shell > > and other drivers we would need to refactor some of this and change the > > mahout script. It seems desirable to have and driver code create the > Spark > > context and rely on spark-submit for any config overrides and params. > This > > implies the possible removal (not sure about this) of SparkMahoutContext. > > In general it would be nice if this were done outside of Mahout, or > limited > > to the drivers and shell. Mahout has become a library that is designed to > > be backend independent. This code was designed before this became a goal > > and is beyond my understanding to fully grasp how much work would be > > involved and what would replace it. > > > > The code refactoring needed is not well understood, by me at least. But > > intuition says that with a growing number of backends it might be good to > > clean up the Spark dependencies for context management. This has also > been > > a bit of a problem in creating apps that use Mahout since typical > > spark-submit use cannot be relied on to make config changes, they must be > > made in environment variables only. These arguably non-standard > > manipulation of the context puts limitations and hidden assumptions into > > using Mahout as a library. > > > > Doing all of this implies a fairly large bit of work, I think. The > benefit > > is that it will be more clear how to use Mahout as a library and in > > cleaning up some unneeded code. I’m not sure I have enough time to do all > > of this myself. > > > > This isn’t so much a proposal as a call for discussion. > > > > > > > >
