Re: spark-defaults.conf optimal configuration

2015-12-09 Thread cjrumble
Hello Neelesh,

Thank you for the checklist for determining the correct configuration of
Spark. I will go through these and let you know if I have further questions. 

Regards,

Chris 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641p25649.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-defaults.conf optimal configuration

2015-12-08 Thread nsalian
Hi Chris,

Thank you for posting the question.
Tuning spark configurations is a tricky task since there are a lot factors
to consider.
The configurations that you listed cover the most them.

To understand the situation that can guide you in making a decision about
tuning:
1) What kind of spark applications are you intending to run?
2) What cluster manager have you decided to go with? 
3) How frequent are these applications going to run? (For the sake of
scheduling)
4) Is this used by multiple users? 
5) What else do you have in the cluster that will interact with Spark? (For
the sake of resolving dependencies)
Personally, I would suggest to have these questions  prior to jumping on the
idea of tuning.
A cluster manager like YARN would help understand the settings for cores and
memory since the applications have to be considered for scheduling.

Hope that helps to start off in the right direction.





-
Neelesh S. Salian
Cloudera
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641p25642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-defaults.conf optimal configuration

2015-12-08 Thread cjrumble
I am seeking help with a Spark configuration running queries against a
cluster of 6 machines. Each machine has Spark 1.5.1 with slaves started on 6
and 1 acting as master/thriftserver. I query from Beeline 2 tables that have
300M and 31M rows respectively. Results from my queries thus far return up
to 500M rows when queried using Oracle but Spark errors at anything more
than 5.5M rows. 

I believe there is an optimal memory configuration that must be set for each
of the workers in our cluster but I have not been able to determine that
setting. Is there something better than trial and error? Are there settings
to avoid such as making sure not to set spark.driver.maxResultSize >
spark.driver.memory?

Is there a formula or guidelines by which to calculate the correct Spark
configuration values when given a machines available cores and memory
resources? 

This is my current configuration:
BDA v3 server : SUN SERVER X4-2L
Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
CPU cores : 32
GB of memory (>=63): 63
number of disks : 12    spark-defaults.conf

spark.driver.memory 20g
spark.executor.memory 40g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
spark.rpc.askTimeout6000s
spark.rpc.lookupTimeout3000s
spark.driver.maxResultSize20g
spark.rdd.compress   true
spark.storage.memoryFraction1
spark.core.connection.ack.wait.timeout 600
spark.akka.frameSize500
spark.shuffle.compress  true
spark.shuffle.file.buffer 128k
spark.shuffle.memoryFraction0
spark.shuffle.spill.compress   true
spark.shuffle.spill true

Thank you,

Chris



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-submit not using conf/spark-defaults.conf

2015-09-03 Thread Davies Liu
I think it's a missing feature.

On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl <a...@whisperstream.com> wrote:
> So a bit more investigation, shows that:
>
> if I have configured spark-defaults.conf with:
>
> "spark.files  library.py"
>
> then if I call
>
> "spark-submit.py -v test.py"
>
> I see that my "spark.files" default option has been replaced with
> "spark.files  test.py",  basically spark-submit is overwriting
> spark.files with the name of the script.
>
> Is this a bug or is there another way to add default libraries without
> having to specify them on the command line?
>
> Thanks,
>
> -Axel
>
>
>
> On Wed, Sep 2, 2015 at 10:34 PM, Davies Liu <dav...@databricks.com> wrote:
>>
>> This should be a bug, could you create a JIRA for it?
>>
>> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote:
>> > in my spark-defaults.conf I have:
>> > spark.files   file1.zip, file2.py
>> > spark.master   spark://master.domain.com:7077
>> >
>> > If I execute:
>> > bin/pyspark
>> >
>> > I can see it adding the files correctly.
>> >
>> > However if I execute
>> >
>> > bin/spark-submit test.py
>> >
>> > where test.py relies on the file1.zip, I get and error.
>> >
>> > If I i instead execute
>> >
>> > bin/spark-submit --py-files file1.zip test.py
>> >
>> > It works as expected.
>> >
>> > How do I get spark-submit to import the spark-defaults.conf file or what
>> > should I start checking to figure out why one works and the other
>> > doesn't?
>> >
>> > Thanks,
>> >
>> > -Axel
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-submit not using conf/spark-defaults.conf

2015-09-03 Thread Axel Dahl
logged it here:

https://issues.apache.org/jira/browse/SPARK-10436

On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu <dav...@databricks.com> wrote:

> I think it's a missing feature.
>
> On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl <a...@whisperstream.com> wrote:
> > So a bit more investigation, shows that:
> >
> > if I have configured spark-defaults.conf with:
> >
> > "spark.files  library.py"
> >
> > then if I call
> >
> > "spark-submit.py -v test.py"
> >
> > I see that my "spark.files" default option has been replaced with
> > "spark.files  test.py",  basically spark-submit is overwriting
> > spark.files with the name of the script.
> >
> > Is this a bug or is there another way to add default libraries without
> > having to specify them on the command line?
> >
> > Thanks,
> >
> > -Axel
> >
> >
> >
> > On Wed, Sep 2, 2015 at 10:34 PM, Davies Liu <dav...@databricks.com>
> wrote:
> >>
> >> This should be a bug, could you create a JIRA for it?
> >>
> >> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com>
> wrote:
> >> > in my spark-defaults.conf I have:
> >> > spark.files   file1.zip, file2.py
> >> > spark.master   spark://master.domain.com:7077
> >> >
> >> > If I execute:
> >> > bin/pyspark
> >> >
> >> > I can see it adding the files correctly.
> >> >
> >> > However if I execute
> >> >
> >> > bin/spark-submit test.py
> >> >
> >> > where test.py relies on the file1.zip, I get and error.
> >> >
> >> > If I i instead execute
> >> >
> >> > bin/spark-submit --py-files file1.zip test.py
> >> >
> >> > It works as expected.
> >> >
> >> > How do I get spark-submit to import the spark-defaults.conf file or
> what
> >> > should I start checking to figure out why one works and the other
> >> > doesn't?
> >> >
> >> > Thanks,
> >> >
> >> > -Axel
> >
> >
>


spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
in my spark-defaults.conf I have:
spark.files   file1.zip, file2.py
spark.master   spark://master.domain.com:7077

If I execute:
bin/pyspark

I can see it adding the files correctly.

However if I execute

bin/spark-submit test.py

where test.py relies on the file1.zip, I get and error.

If I i instead execute

bin/spark-submit --py-files file1.zip test.py

It works as expected.

How do I get spark-submit to import the spark-defaults.conf file or what
should I start checking to figure out why one works and the other doesn't?

Thanks,

-Axel


Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Davies Liu
This should be a bug, could you create a JIRA for it?

On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote:
> in my spark-defaults.conf I have:
> spark.files   file1.zip, file2.py
> spark.master   spark://master.domain.com:7077
>
> If I execute:
> bin/pyspark
>
> I can see it adding the files correctly.
>
> However if I execute
>
> bin/spark-submit test.py
>
> where test.py relies on the file1.zip, I get and error.
>
> If I i instead execute
>
> bin/spark-submit --py-files file1.zip test.py
>
> It works as expected.
>
> How do I get spark-submit to import the spark-defaults.conf file or what
> should I start checking to figure out why one works and the other doesn't?
>
> Thanks,
>
> -Axel

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark-submit not using conf/spark-defaults.conf

2015-09-02 Thread Axel Dahl
So a bit more investigation, shows that:

if I have configured spark-defaults.conf with:

"spark.files  library.py"

then if I call

"spark-submit.py -v test.py"

I see that my "spark.files" default option has been replaced with
"spark.files  test.py",  basically spark-submit is overwriting
spark.files with the name of the script.

Is this a bug or is there another way to add default libraries without
having to specify them on the command line?

Thanks,

-Axel



On Wed, Sep 2, 2015 at 10:34 PM, Davies Liu <dav...@databricks.com> wrote:

> This should be a bug, could you create a JIRA for it?
>
> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote:
> > in my spark-defaults.conf I have:
> > spark.files   file1.zip, file2.py
> > spark.master   spark://master.domain.com:7077
> >
> > If I execute:
> > bin/pyspark
> >
> > I can see it adding the files correctly.
> >
> > However if I execute
> >
> > bin/spark-submit test.py
> >
> > where test.py relies on the file1.zip, I get and error.
> >
> > If I i instead execute
> >
> > bin/spark-submit --py-files file1.zip test.py
> >
> > It works as expected.
> >
> > How do I get spark-submit to import the spark-defaults.conf file or what
> > should I start checking to figure out why one works and the other
> doesn't?
> >
> > Thanks,
> >
> > -Axel
>


RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
Does anybody have any idea how to solve this problem?

Ningjun

From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, July 30, 2015 11:06 AM
To: user@spark.apache.org
Subject: How to register array class with Kyro in spark-defaults.conf

I register my class with Kyro in spark-defaults.conf as follow

spark.serializer   
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired true
spark.kryo.classesToRegister  ltn.analytics.es.EsDoc

But I got the following exception

java.lang.IllegalArgumentException: Class is not registered: 
ltn.analytics.es.EsDoc[]
Note: To register this class use: kryo.register(ltn.analytics.es.EsDoc[].class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The error message seems to suggest that I should also register the array class 
EsDoc[]. So I add it to spark-defaults.conf as follow

spark.kryo.classesToRegister  ltn.analytics.es.EsDoc,ltn.analytics.es.EsDoc[]

Then I got the following error

org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:101)
at 
org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:153)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:115)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:200)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101)
at 
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
at ltn.analytics.index.Index.addDocuments(Index.scala:82)

Please advise.

Thanks.
Ningjun


RE: How to register array class with Kyro in spark-defaults.conf

2015-07-31 Thread Wang, Ningjun (LNG-NPV)
Here is the definition of EsDoc

case class EsDoc(id: Long, isExample: Boolean, docSetIds: Array[String], 
randomId: Double, vector: String) extends Serializable

Note that it is not EsDoc having problem with registration. It is the EsDoc[]  
(the array class of EsDoc) that has problem with registration.

I have tried to replace the class EsDoc by the Map class, I also got the 
following error ask me to register the Map[] (array of Map) class

java.lang.IllegalArgumentException: Class is not registered: 
scala.collection.immutable.Map[]
Note: To register this class use: 
kryo.register(scala.collection.immutable.Map[].class);

So the question is how to register Array class? Adding the following in 
spark-defauls.conf does not work

spark.kryo.classesToRegister  
scala.collection.immutable.Map,scala.collection.immutable.Map[]

Ningjun

From: Ted Yu [mailto:yuzhih...@gmail.com]
Sent: Friday, July 31, 2015 11:49 AM
To: Wang, Ningjun (LNG-NPV)
Cc: user@spark.apache.org
Subject: Re: How to register array class with Kyro in spark-defaults.conf

For the second exception, was there anything following SparkException which 
would give us more clue ?

Can you tell us how EsDoc is structured ?

Thanks

On Fri, Jul 31, 2015 at 8:42 AM, Wang, Ningjun (LNG-NPV) 
ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote:
Does anybody have any idea how to solve this problem?

Ningjun

From: Wang, Ningjun (LNG-NPV)
Sent: Thursday, July 30, 2015 11:06 AM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: How to register array class with Kyro in spark-defaults.conf

I register my class with Kyro in spark-defaults.conf as follow

spark.serializer   
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired true
spark.kryo.classesToRegister  ltn.analytics.es.EsDoc

But I got the following exception

java.lang.IllegalArgumentException: Class is not registered: 
ltn.analytics.es.EsDoc[]
Note: To register this class use: kryo.register(ltn.analytics.es.EsDoc[].class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


The error message seems to suggest that I should also register the array class 
EsDoc[]. So I add it to spark-defaults.conf as follow

spark.kryo.classesToRegister  ltn.analytics.es.EsDoc,ltn.analytics.es.EsDoc[]

Then I got the following error

org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:101)
at 
org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:153)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:115)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:200)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101)
at 
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
at ltn.analytics.index.Index.addDocuments(Index.scala:82)

Please advise.

Thanks.
Ningjun



How to register array class with Kyro in spark-defaults.conf

2015-07-30 Thread Wang, Ningjun (LNG-NPV)
I register my class with Kyro in spark-defaults.conf as follow

spark.serializer   
org.apache.spark.serializer.KryoSerializer
spark.kryo.registrationRequired true
spark.kryo.classesToRegister  ltn.analytics.es.EsDoc

But I got the following exception

java.lang.IllegalArgumentException: Class is not registered: 
ltn.analytics.es.EsDoc[]
Note: To register this class use: kryo.register(ltn.analytics.es.EsDoc[].class);
at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79)
at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565)
at 
org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

The error message seems to suggest that I should also register the array class 
EsDoc[]. So I add it to spark-defaults.conf as follow

spark.kryo.classesToRegister  ltn.analytics.es.EsDoc,ltn.analytics.es.EsDoc[]

Then I got the following error

org.apache.spark.SparkException: Failed to register classes with Kryo
at 
org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:101)
at 
org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:153)
at 
org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:115)
at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:200)
at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101)
at 
org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29)
at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051)
at ltn.analytics.index.Index.addDocuments(Index.scala:82)
Please advise.

Thanks.
Ningjun


Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
Would there be any problem in having spark.executor.instances (or 
--num-executors) be completely ignored (i.e., even for non-zero values) if 
spark.dynamicAllocation.enabled is true (i.e., rather than throwing an 
exception)?

I can see how the exception would be helpful if, say, you tried to pass both 
-c spark.executor.instances (or --num-executors) *and* -c 
spark.dynamicAllocation.enabled=true to spark-submit on the command line (as 
opposed to having one of them in spark-defaults.conf and one of them in the 
spark-submit args), but currently there doesn't seem to be any way to 
distinguish between arguments that were actually passed to spark-submit and 
settings that simply came from spark-defaults.conf.

If there were a way to distinguish them, I think the ideal situation would be 
for the validation exception to be thrown only if spark.executor.instances and 
spark.dynamicAllocation.enabled=true were both passed via spark-submit args or 
were both present in spark-defaults.conf, but passing 
spark.dynamicAllocation.enabled=true to spark-submit would take precedence over 
spark.executor.instances configured in spark-defaults.conf, and vice versa.

Jonathan Kelly
Elastic MapReduce - SDE
Blackfoot (SEA33) 06.850.F0

From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Tuesday, July 14, 2015 at 4:23 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Unable to use dynamicAllocation if spark.executor.instances is set in 
spark-defaults.conf

I've set up my cluster with a pre-calcualted value for spark.executor.instances 
in spark-defaults.conf such that I can run a job and have it maximize the 
utilization of the cluster resources by default. However, if I want to run a 
job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true 
to spark-submit), I get this exception:

Exception in thread main java.lang.IllegalArgumentException: Explicitly 
setting the number of executors is not compatible with 
spark.dynamicAllocation.enabled!
at 
org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
...

The exception makes sense, of course, but ideally I would like it to ignore 
what I've put in spark-defaults.conf for spark.executor.instances if I've 
enabled dynamicAllocation. The most annoying thing about this is that if I have 
spark.executor.instances present in spark-defaults.conf, I cannot figure out 
any way to spark-submit a job with spark.dynamicAllocation.enabled=true without 
getting this error. That is, even if I pass -c spark.executor.instances=0 -c 
spark.dynamicAllocation.enabled=true, I still get this error because the 
validation in ClientArguments.parseArgs() that's checking for this condition 
simply checks for the presence of spark.executor.instances rather than whether 
or not its value is  0.

Should the check be changed to allow spark.executor.instances to be set to 0 if 
spark.dynamicAllocation.enabled is true? That would be an OK compromise, but 
I'd really prefer to be able to enable dynamicAllocation simply by setting 
spark.dynamicAllocation.enabled=true rather than by also having to set 
spark.executor.instances to 0.

Thanks,
Jonathan


Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Kelly, Jonathan
bump

From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com
Date: Tuesday, July 14, 2015 at 4:23 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Unable to use dynamicAllocation if spark.executor.instances is set in 
spark-defaults.conf

I've set up my cluster with a pre-calcualted value for spark.executor.instances 
in spark-defaults.conf such that I can run a job and have it maximize the 
utilization of the cluster resources by default. However, if I want to run a 
job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true 
to spark-submit), I get this exception:

Exception in thread main java.lang.IllegalArgumentException: Explicitly 
setting the number of executors is not compatible with 
spark.dynamicAllocation.enabled!
at 
org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
...

The exception makes sense, of course, but ideally I would like it to ignore 
what I've put in spark-defaults.conf for spark.executor.instances if I've 
enabled dynamicAllocation. The most annoying thing about this is that if I have 
spark.executor.instances present in spark-defaults.conf, I cannot figure out 
any way to spark-submit a job with spark.dynamicAllocation.enabled=true without 
getting this error. That is, even if I pass -c spark.executor.instances=0 -c 
spark.dynamicAllocation.enabled=true, I still get this error because the 
validation in ClientArguments.parseArgs() that's checking for this condition 
simply checks for the presence of spark.executor.instances rather than whether 
or not its value is  0.

Should the check be changed to allow spark.executor.instances to be set to 0 if 
spark.dynamicAllocation.enabled is true? That would be an OK compromise, but 
I'd really prefer to be able to enable dynamicAllocation simply by setting 
spark.dynamicAllocation.enabled=true rather than by also having to set 
spark.executor.instances to 0.

Thanks,
Jonathan


Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Sandy Ryza
Hi Jonathan,

This is a problem that has come up for us as well, because we'd like
dynamic allocation to be turned on by default in some setups, but not break
existing users with these properties.  I'm hoping to figure out a way to
reconcile these by Spark 1.5.

-Sandy

On Wed, Jul 15, 2015 at 3:18 PM, Kelly, Jonathan jonat...@amazon.com
wrote:

   Would there be any problem in having spark.executor.instances (or
 --num-executors) be completely ignored (i.e., even for non-zero values) if
 spark.dynamicAllocation.enabled is true (i.e., rather than throwing an
 exception)?

  I can see how the exception would be helpful if, say, you tried to pass
 both -c spark.executor.instances (or --num-executors) *and* -c
 spark.dynamicAllocation.enabled=true to spark-submit on the command line
 (as opposed to having one of them in spark-defaults.conf and one of them in
 the spark-submit args), but currently there doesn't seem to be any way to
 distinguish between arguments that were actually passed to spark-submit and
 settings that simply came from spark-defaults.conf.

  If there were a way to distinguish them, I think the ideal situation
 would be for the validation exception to be thrown only if
 spark.executor.instances and spark.dynamicAllocation.enabled=true were both
 passed via spark-submit args or were both present in spark-defaults.conf,
 but passing spark.dynamicAllocation.enabled=true to spark-submit would take
 precedence over spark.executor.instances configured in spark-defaults.conf,
 and vice versa.


  Jonathan Kelly

 Elastic MapReduce - SDE

 Blackfoot (SEA33) 06.850.F0

   From: Jonathan Kelly jonat...@amazon.com
 Date: Tuesday, July 14, 2015 at 4:23 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Unable to use dynamicAllocation if spark.executor.instances is
 set in spark-defaults.conf

   I've set up my cluster with a pre-calcualted value for
 spark.executor.instances in spark-defaults.conf such that I can run a job
 and have it maximize the utilization of the cluster resources by default.
 However, if I want to run a job with dynamicAllocation (by passing -c
 spark.dynamicAllocation.enabled=true to spark-submit), I get this exception:

  Exception in thread main java.lang.IllegalArgumentException:
 Explicitly setting the number of executors is not compatible with
 spark.dynamicAllocation.enabled!
 at
 org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
 at
 org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
 at
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
  …

  The exception makes sense, of course, but ideally I would like it to
 ignore what I've put in spark-defaults.conf for spark.executor.instances if
 I've enabled dynamicAllocation. The most annoying thing about this is that
 if I have spark.executor.instances present in spark-defaults.conf, I cannot
 figure out any way to spark-submit a job with
 spark.dynamicAllocation.enabled=true without getting this error. That is,
 even if I pass -c spark.executor.instances=0 -c
 spark.dynamicAllocation.enabled=true, I still get this error because the
 validation in ClientArguments.parseArgs() that's checking for this
 condition simply checks for the presence of spark.executor.instances rather
 than whether or not its value is  0.

  Should the check be changed to allow spark.executor.instances to be set
 to 0 if spark.dynamicAllocation.enabled is true? That would be an OK
 compromise, but I'd really prefer to be able to enable dynamicAllocation
 simply by setting spark.dynamicAllocation.enabled=true rather than by also
 having to set spark.executor.instances to 0.


  Thanks,

 Jonathan



Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-15 Thread Andrew Or
Yeah, we could make it a log a warning instead.

2015-07-15 14:29 GMT-07:00 Kelly, Jonathan jonat...@amazon.com:

  Thanks! Is there an existing JIRA I should watch?


  ~ Jonathan

   From: Sandy Ryza sandy.r...@cloudera.com
 Date: Wednesday, July 15, 2015 at 2:27 PM
 To: Jonathan Kelly jonat...@amazon.com
 Cc: user@spark.apache.org user@spark.apache.org
 Subject: Re: Unable to use dynamicAllocation if spark.executor.instances
 is set in spark-defaults.conf

   Hi Jonathan,

  This is a problem that has come up for us as well, because we'd like
 dynamic allocation to be turned on by default in some setups, but not break
 existing users with these properties.  I'm hoping to figure out a way to
 reconcile these by Spark 1.5.

  -Sandy

 On Wed, Jul 15, 2015 at 3:18 PM, Kelly, Jonathan jonat...@amazon.com
 wrote:

   Would there be any problem in having spark.executor.instances (or
 --num-executors) be completely ignored (i.e., even for non-zero values) if
 spark.dynamicAllocation.enabled is true (i.e., rather than throwing an
 exception)?

  I can see how the exception would be helpful if, say, you tried to pass
 both -c spark.executor.instances (or --num-executors) *and* -c
 spark.dynamicAllocation.enabled=true to spark-submit on the command line
 (as opposed to having one of them in spark-defaults.conf and one of them in
 the spark-submit args), but currently there doesn't seem to be any way to
 distinguish between arguments that were actually passed to spark-submit and
 settings that simply came from spark-defaults.conf.

  If there were a way to distinguish them, I think the ideal situation
 would be for the validation exception to be thrown only if
 spark.executor.instances and spark.dynamicAllocation.enabled=true were both
 passed via spark-submit args or were both present in spark-defaults.conf,
 but passing spark.dynamicAllocation.enabled=true to spark-submit would take
 precedence over spark.executor.instances configured in spark-defaults.conf,
 and vice versa.


  Jonathan Kelly

 Elastic MapReduce - SDE

 Blackfoot (SEA33) 06.850.F0

   From: Jonathan Kelly jonat...@amazon.com
 Date: Tuesday, July 14, 2015 at 4:23 PM
 To: user@spark.apache.org user@spark.apache.org
 Subject: Unable to use dynamicAllocation if spark.executor.instances is
 set in spark-defaults.conf

I've set up my cluster with a pre-calcualted value for
 spark.executor.instances in spark-defaults.conf such that I can run a job
 and have it maximize the utilization of the cluster resources by default.
 However, if I want to run a job with dynamicAllocation (by passing -c
 spark.dynamicAllocation.enabled=true to spark-submit), I get this exception:

  Exception in thread main java.lang.IllegalArgumentException:
 Explicitly setting the number of executors is not compatible with
 spark.dynamicAllocation.enabled!
 at
 org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
 at
 org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
 at
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
  …

  The exception makes sense, of course, but ideally I would like it to
 ignore what I've put in spark-defaults.conf for spark.executor.instances if
 I've enabled dynamicAllocation. The most annoying thing about this is that
 if I have spark.executor.instances present in spark-defaults.conf, I cannot
 figure out any way to spark-submit a job with
 spark.dynamicAllocation.enabled=true without getting this error. That is,
 even if I pass -c spark.executor.instances=0 -c
 spark.dynamicAllocation.enabled=true, I still get this error because the
 validation in ClientArguments.parseArgs() that's checking for this
 condition simply checks for the presence of spark.executor.instances rather
 than whether or not its value is  0.

  Should the check be changed to allow spark.executor.instances to be set
 to 0 if spark.dynamicAllocation.enabled is true? That would be an OK
 compromise, but I'd really prefer to be able to enable dynamicAllocation
 simply by setting spark.dynamicAllocation.enabled=true rather than by also
 having to set spark.executor.instances to 0.


  Thanks,

 Jonathan





Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf

2015-07-14 Thread Kelly, Jonathan
I've set up my cluster with a pre-calcualted value for spark.executor.instances 
in spark-defaults.conf such that I can run a job and have it maximize the 
utilization of the cluster resources by default. However, if I want to run a 
job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true 
to spark-submit), I get this exception:

Exception in thread main java.lang.IllegalArgumentException: Explicitly 
setting the number of executors is not compatible with 
spark.dynamicAllocation.enabled!
at 
org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192)
at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54)
...

The exception makes sense, of course, but ideally I would like it to ignore 
what I've put in spark-defaults.conf for spark.executor.instances if I've 
enabled dynamicAllocation. The most annoying thing about this is that if I have 
spark.executor.instances present in spark-defaults.conf, I cannot figure out 
any way to spark-submit a job with spark.dynamicAllocation.enabled=true without 
getting this error. That is, even if I pass -c spark.executor.instances=0 -c 
spark.dynamicAllocation.enabled=true, I still get this error because the 
validation in ClientArguments.parseArgs() that's checking for this condition 
simply checks for the presence of spark.executor.instances rather than whether 
or not its value is  0.

Should the check be changed to allow spark.executor.instances to be set to 0 if 
spark.dynamicAllocation.enabled is true? That would be an OK compromise, but 
I'd really prefer to be able to enable dynamicAllocation simply by setting 
spark.dynamicAllocation.enabled=true rather than by also having to set 
spark.executor.instances to 0.

Thanks,
Jonathan


Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread yana
Thanks. Without spark submit it seems the more straightforward solution is to 
just pass it on the driver's classpath. I was more surprised that the same conf 
parameter had different behavior depending on where it's specified. Program vs 
spark-defaults. Im all set now- thanks for replying

div Original message /divdivFrom: Akhil Das 
ak...@sigmoidanalytics.com /divdivDate:07/01/2015  2:27 AM  (GMT-05:00) 
/divdivTo: Yana Kadiyska yana.kadiy...@gmail.com /divdivCc: 
user@spark.apache.org /divdivSubject: Re: Difference between 
spark-defaults.conf and SparkConf.set /divdiv
/div.addJar works for me when i run it as a stand-alone application (without 
using spark-submit)

Thanks
Best Regards

On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote:
Hi folks, running into a pretty strange issue:

I'm setting
spark.executor.extraClassPath 
spark.driver.extraClassPath

to point to some external JARs. If I set them in spark-defaults.conf everything 
works perfectly.
However, if I remove spark-defaults.conf and just create a SparkConf and call 
.set(spark.executor.extraClassPath,...)
.set(spark.driver.extraClassPath,...) 

I get ClassNotFound exceptions from Hadoop Conf:

Caused by: java.lang.ClassNotFoundException: Class 
org.apache.hadoop.fs.ceph.CephFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585)

This seems like a bug to me -- or does spark-defaults.conf somehow get 
processed differently?

I have dumped out sparkConf.toDebugString and in both cases 
(spark-defaults.conf/in code sets) it seems to have the same values in it...



Re: Difference between spark-defaults.conf and SparkConf.set

2015-07-01 Thread Akhil Das
.addJar works for me when i run it as a stand-alone application (without
using spark-submit)

Thanks
Best Regards

On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com
wrote:

 Hi folks, running into a pretty strange issue:

 I'm setting
 spark.executor.extraClassPath
 spark.driver.extraClassPath

 to point to some external JARs. If I set them in spark-defaults.conf
 everything works perfectly.
 However, if I remove spark-defaults.conf and just create a SparkConf and
 call
 .set(spark.executor.extraClassPath,...)
 .set(spark.driver.extraClassPath,...)

 I get ClassNotFound exceptions from Hadoop Conf:

 Caused by: java.lang.ClassNotFoundException: Class 
 org.apache.hadoop.fs.ceph.CephFileSystem not found
 at 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
 at 
 org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585)

 ​

 This seems like a bug to me -- or does spark-defaults.conf somehow get
 processed differently?

 I have dumped out sparkConf.toDebugString and in both cases
 (spark-defaults.conf/in code sets) it seems to have the same values in it...



Difference between spark-defaults.conf and SparkConf.set

2015-06-30 Thread Yana Kadiyska
Hi folks, running into a pretty strange issue:

I'm setting
spark.executor.extraClassPath
spark.driver.extraClassPath

to point to some external JARs. If I set them in spark-defaults.conf
everything works perfectly.
However, if I remove spark-defaults.conf and just create a SparkConf and
call
.set(spark.executor.extraClassPath,...)
.set(spark.driver.extraClassPath,...)

I get ClassNotFound exceptions from Hadoop Conf:

Caused by: java.lang.ClassNotFoundException: Class
org.apache.hadoop.fs.ceph.CephFileSystem not found
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585)

​

This seems like a bug to me -- or does spark-defaults.conf somehow get
processed differently?

I have dumped out sparkConf.toDebugString and in both cases
(spark-defaults.conf/in code sets) it seems to have the same values in it...


Re: spark-defaults.conf

2015-04-28 Thread James King
So no takers regarding why spark-defaults.conf is not being picked up.

Here is another one:

If Zookeeper is configured in Spark why do we need to start a slave like
this:

spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh 1 spark://somemaster:7077

i.e. why do we need to specify the master url explicitly

Shouldn't Spark just consult with ZK and us the active master?

Or is ZK only used during failure?


On Mon, Apr 27, 2015 at 1:53 PM, James King jakwebin...@gmail.com wrote:

 Thanks.

 I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile

 But when I start worker like this

 spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

 I still get

 failed to launch org.apache.spark.deploy.worker.Worker:
  Default is conf/spark-defaults.conf.
   15/04/27 11:51:33 DEBUG Utils: Shutdown hook called





 On Mon, Apr 27, 2015 at 1:15 PM, Zoltán Zvara zoltan.zv...@gmail.com
 wrote:

 You should distribute your configuration file to workers and set the
 appropriate environment variables, like HADOOP_HOME, SPARK_HOME,
 HADOOP_CONF_DIR, SPARK_CONF_DIR.

 On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com
 wrote:

 I renamed spark-defaults.conf.template to spark-defaults.conf
 and invoked

 spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

 But I still get

 failed to launch org.apache.spark.deploy.worker.Worker:
 --properties-file FILE   Path to a custom Spark properties file.
  Default is conf/spark-defaults.conf.

 But I'm thinking it should pick up the default spark-defaults.conf from
 conf dir

 Am I expecting or doing something wrong?

 Regards
 jk






spark-defaults.conf

2015-04-27 Thread James King
I renamed spark-defaults.conf.template to spark-defaults.conf
and invoked

spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

But I still get

failed to launch org.apache.spark.deploy.worker.Worker:
--properties-file FILE   Path to a custom Spark properties file.
 Default is conf/spark-defaults.conf.

But I'm thinking it should pick up the default spark-defaults.conf from
conf dir

Am I expecting or doing something wrong?

Regards
jk


Re: spark-defaults.conf

2015-04-27 Thread Zoltán Zvara
You should distribute your configuration file to workers and set the
appropriate environment variables, like HADOOP_HOME, SPARK_HOME,
HADOOP_CONF_DIR, SPARK_CONF_DIR.

On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote:

 I renamed spark-defaults.conf.template to spark-defaults.conf
 and invoked

 spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

 But I still get

 failed to launch org.apache.spark.deploy.worker.Worker:
 --properties-file FILE   Path to a custom Spark properties file.
  Default is conf/spark-defaults.conf.

 But I'm thinking it should pick up the default spark-defaults.conf from
 conf dir

 Am I expecting or doing something wrong?

 Regards
 jk





Re: spark-defaults.conf

2015-04-27 Thread James King
Thanks.

I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile

But when I start worker like this

spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

I still get

failed to launch org.apache.spark.deploy.worker.Worker:
 Default is conf/spark-defaults.conf.
  15/04/27 11:51:33 DEBUG Utils: Shutdown hook called





On Mon, Apr 27, 2015 at 1:15 PM, Zoltán Zvara zoltan.zv...@gmail.com
wrote:

 You should distribute your configuration file to workers and set the
 appropriate environment variables, like HADOOP_HOME, SPARK_HOME,
 HADOOP_CONF_DIR, SPARK_CONF_DIR.

 On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote:

 I renamed spark-defaults.conf.template to spark-defaults.conf
 and invoked

 spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh

 But I still get

 failed to launch org.apache.spark.deploy.worker.Worker:
 --properties-file FILE   Path to a custom Spark properties file.
  Default is conf/spark-defaults.conf.

 But I'm thinking it should pick up the default spark-defaults.conf from
 conf dir

 Am I expecting or doing something wrong?

 Regards
 jk





Spark 1.2, trying to run spark-history as a service, spark-defaults.conf are ignored

2015-04-14 Thread Serega Sheypak
Here is related problem:
http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html

but no answer.
What I'm trying to do: wrap spark-history with /etc/init.d script
Problems I have: can't make it read spark-defaults.conf
I've put this file here:
/etc/spark/conf
/usr/lib/spark/conf where /usr/lib/spark is locaition for spark
no luck.

spark-history tries to use default value for applications log location, it
doesn't read specified value  from  spark-defaults.conf


Can value in spark-defaults.conf support system variables?

2014-09-01 Thread Zhanfeng Huo
Hi,all:

Can value in spark-defaults.conf support system variables?

Such as mess = ${user.home}/${user.name}. 

Best Regards



Zhanfeng Huo


Re: Can value in spark-defaults.conf support system variables?

2014-09-01 Thread Andrew Or
No, not currently.


2014-09-01 2:53 GMT-07:00 Zhanfeng Huo huozhanf...@gmail.com:

 Hi,all:

 Can value in spark-defaults.conf support system variables?

 Such as mess = ${user.home}/${user.name}.

 Best Regards

 --
 Zhanfeng Huo



Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 
when you specify the location in conf/spark-defaults.conf for
spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable. 
For example, I'm running the command with user 'test'.
In spark-submit, the folder will be created on-the-fly and you will see the 
event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152
but in spark-shell, the user 'test' folder is not created, and you will see 
this /user/$USER/spark/logs on HDFS. It will try to create 
/user/$USER/spark/logs instead of /user/test/spark/logs.
It looks like spark-shell couldn't pick up the env variable $USER to apply for 
the eventLog directory for the running user 'test'.
Is this considered a bug or bad practice to use spark-shell with Spark's 
HistoryServer?








  

Re: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Or
Hi Andrew,

It's definitely not bad practice to use spark-shell with HistoryServer. The
issue here is not with spark-shell, but the way we pass Spark configs to
the application. spark-defaults.conf does not currently support embedding
environment variables, but instead interprets everything as a string
literal. You will have to manually specify test instead of $USER in the
path you provide to spark.eventLog.dir.

-Andrew


2014-07-28 12:40 GMT-07:00 Andrew Lee alee...@hotmail.com:

 Hi All,

 Not sure if anyone has ran into this problem, but this exist in spark
 1.0.0 when you specify the location in *conf/spark-defaults.conf* for

 spark.eventLog.dir hdfs:///user/$USER/spark/logs

 to use the *$USER* env variable.

 For example, I'm running the command with user 'test'.

 In *spark-submit*, the folder will be created on-the-fly and you will see
 the event logs created on HDFS
 */user/test/spark/logs/spark-pi-1405097484152*

 but in *spark-shell*, the user 'test' folder is not created, and you will
 see this */user/$USER/spark/logs* on HDFS. It will try to create
 */user/$USER/spark/logs* instead of */user/test/spark/logs*.

 It looks like spark-shell couldn't pick up the env variable $USER to apply
 for the eventLog directory for the running user 'test'.

 Is this considered a bug or bad practice to use spark-shell with Spark's
 HistoryServer?




RE: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir

2014-07-28 Thread Andrew Lee
Hi Andrew,
Thanks to re-confirm the problem. I thought it only happens to my own build. :)
by the way, we have multiple users using the spark-shell to explore their 
dataset, and we are continuously looking into ways to isolate their jobs 
history. In the current situation, we can't really ask them to create their own 
spark-defaults.conf since this is set to read-only. A workaround is to set it 
to a shared folder e.g. /user/spark/logs and user permission 1777. This isn't 
really ideal since other people can see what are the other jobs running on the 
shared cluster.
It will be nice to have a better security if this is enhanced so people aren't 
exposing their algorithm (which is usually embed in their job's name) to other 
users.
Will there or is there a JIRA ticket to keep track of this? any plan to enhance 
this part for spark-shell ?


Date: Mon, 28 Jul 2014 13:54:56 -0700
Subject: Re: Issues on spark-shell and spark-submit behave differently on 
spark-defaults.conf parameter spark.eventLog.dir
From: and...@databricks.com
To: user@spark.apache.org

Hi Andrew,
It's definitely not bad practice to use spark-shell with HistoryServer. The 
issue here is not with spark-shell, but the way we pass Spark configs to the 
application. spark-defaults.conf does not currently support embedding 
environment variables, but instead interprets everything as a string literal. 
You will have to manually specify test instead of $USER in the path you 
provide to spark.eventLog.dir.

-Andrew

2014-07-28 12:40 GMT-07:00 Andrew Lee alee...@hotmail.com:




Hi All,
Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 
when you specify the location in conf/spark-defaults.conf for

spark.eventLog.dir hdfs:///user/$USER/spark/logs
to use the $USER env variable. 

For example, I'm running the command with user 'test'.
In spark-submit, the folder will be created on-the-fly and you will see the 
event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152

but in spark-shell, the user 'test' folder is not created, and you will see 
this /user/$USER/spark/logs on HDFS. It will try to create 
/user/$USER/spark/logs instead of /user/test/spark/logs.

It looks like spark-shell couldn't pick up the env variable $USER to apply for 
the eventLog directory for the running user 'test'.

Is this considered a bug or bad practice to use spark-shell with Spark's 
HistoryServer?