Re: Spark options

Pat Ferrel Tue, 11 Nov 2014 17:51:12 -0800

The submit code is the only place that documents which are needed by clients 
AFAICT. It is pretty complicated and heavily laden with checks for which 
cluster manager is being used. I’d feel a lot better if we were using it. There 
is no way any of us are going to be able to test on all those configurations.


spark-env.sh is mostly for launching the cluster not the client but there seem 
to be exceptions like executor memory.


On Nov 11, 2014, at 2:18 PM, Dmitriy Lyubimov <[email protected]> wrote:

these files if i read it correctly are for spawning yet another process. i
don't see how it may work for the shell.

I am also not convinced that spark-env is important for the client.


On Tue, Nov 11, 2014 at 2:09 PM, Pat Ferrel <[email protected]> wrote:

> I was thinking -Dx=y too, seems like a good idea.
> 
> But we should also support setting them the way Spark documents in
> spark-env.sh and the two links Andrew found may solve that in a
> maintainable way. Maybe we get the SparkConf from a new mahoutSparkConf
> function, which handles all env supplied setup. For the drivers it can be
> done in the base class allowing and CLI overrides later. Then the SparkConf
> is finally passed in to mahoutSparkContext where as little as possible is
> changed in the conf.
> 
> I’ll look at this for the drivers. Should be easy to add to the shell.
> 
> On Nov 11, 2014, at 12:36 PM, Dmitriy Lyubimov <[email protected]> wrote:
> 
> IMO you just need to modify `mahout spark-shell` to propagate -Dx=y
> parameters to the java startup call and all should be fine.
> 
> On Tue, Nov 11, 2014 at 12:23 PM, Andrew Palumbo <[email protected]>
> wrote:
> 
>> 
>> 
>> 
>> I've run into this problem starting $ mahout shell-script.  i.e. needing
>> to set the spark.kryoserializer.buffer.mb and  spark.akka.frameSize.
> I've
>> been temporarily hard coding them for now while developing.
>> 
>> I'm just getting familiar with What you've done with the CLI drivers.
> For
>> #2 could we borrow option parsing code/methods from spark [1] [2] at each
>> (spark) release and somehow add this to
>> MahoutOptionParser.parseSparkOptions?
>> 
>> I'll hopefully be doing some CLI work soon and have a better
> understanding.
>> 
>> [1]
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitDriverBootstrapper.scala
>> [2]
>> 
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala
>> 
>>> From: [email protected]
>>> Subject: Spark options
>>> Date: Wed, 5 Nov 2014 09:48:59 -0800
>>> To: [email protected]
>>> 
>>> Spark has a launch script as hadoop does. We use the Hadoop launcher
>> script but not the Spark one. When starting up your Spark cluster there
> is
>> a spark-env.sh script that can set a bunch of environment variables. In
> our
>> own mahoutSparkContext function, which takes the place of the Spark
> submit
>> script and launcher we don’t account for most of the environment
> variables.
>>> 
>>> Unless I missed something this means most of the documented options will
>> be ignored unless a user of Mahout parses and sets them in their own
>> SparkConf. The Mahout CLI drivers don’t do this for all possible options,
>> only supporting a few like job name and spark.executor.memory.
>>> 
>>> The question is how to best handle these Spark options. There seem to be
>> two options:
>>> 1) use sparks launch mechanism for drivers but allow some to be
>> overridden in the CLI
>>> 2) add parsing the env for options and set up the SparkConf default in
>> mahoutSparkContext with those variables.
>>> 
>>> The downside of #2 is that as variables change we’ll have to reflect
>> those in our code. I forget why #1 is not an option but Dmitriy has been
>> consistently against this—in any case it would mean a fair bit of
>> refactoring I believe.
>>> 
>>> Any opinions or corrections?
>> 
>> 
>> 
> 
>

Re: Spark options

Reply via email to