Re: spark-defaults.conf optimal configuration
Hello Neelesh, Thank you for the checklist for determining the correct configuration of Spark. I will go through these and let you know if I have further questions. Regards, Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641p25649.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-defaults.conf optimal configuration
Hi Chris, Thank you for posting the question. Tuning spark configurations is a tricky task since there are a lot factors to consider. The configurations that you listed cover the most them. To understand the situation that can guide you in making a decision about tuning: 1) What kind of spark applications are you intending to run? 2) What cluster manager have you decided to go with? 3) How frequent are these applications going to run? (For the sake of scheduling) 4) Is this used by multiple users? 5) What else do you have in the cluster that will interact with Spark? (For the sake of resolving dependencies) Personally, I would suggest to have these questions prior to jumping on the idea of tuning. A cluster manager like YARN would help understand the settings for cores and memory since the applications have to be considered for scheduling. Hope that helps to start off in the right direction. - Neelesh S. Salian Cloudera -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641p25642.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark-defaults.conf optimal configuration
I am seeking help with a Spark configuration running queries against a cluster of 6 machines. Each machine has Spark 1.5.1 with slaves started on 6 and 1 acting as master/thriftserver. I query from Beeline 2 tables that have 300M and 31M rows respectively. Results from my queries thus far return up to 500M rows when queried using Oracle but Spark errors at anything more than 5.5M rows. I believe there is an optimal memory configuration that must be set for each of the workers in our cluster but I have not been able to determine that setting. Is there something better than trial and error? Are there settings to avoid such as making sure not to set spark.driver.maxResultSize > spark.driver.memory? Is there a formula or guidelines by which to calculate the correct Spark configuration values when given a machines available cores and memory resources? This is my current configuration: BDA v3 server : SUN SERVER X4-2L Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz CPU cores : 32 GB of memory (>=63): 63 number of disks : 12 spark-defaults.conf spark.driver.memory 20g spark.executor.memory 40g spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps spark.rpc.askTimeout6000s spark.rpc.lookupTimeout3000s spark.driver.maxResultSize20g spark.rdd.compress true spark.storage.memoryFraction1 spark.core.connection.ack.wait.timeout 600 spark.akka.frameSize500 spark.shuffle.compress true spark.shuffle.file.buffer 128k spark.shuffle.memoryFraction0 spark.shuffle.spill.compress true spark.shuffle.spill true Thank you, Chris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/spark-defaults-conf-optimal-configuration-tp25641.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-submit not using conf/spark-defaults.conf
I think it's a missing feature. On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl <a...@whisperstream.com> wrote: > So a bit more investigation, shows that: > > if I have configured spark-defaults.conf with: > > "spark.files library.py" > > then if I call > > "spark-submit.py -v test.py" > > I see that my "spark.files" default option has been replaced with > "spark.files test.py", basically spark-submit is overwriting > spark.files with the name of the script. > > Is this a bug or is there another way to add default libraries without > having to specify them on the command line? > > Thanks, > > -Axel > > > > On Wed, Sep 2, 2015 at 10:34 PM, Davies Liu <dav...@databricks.com> wrote: >> >> This should be a bug, could you create a JIRA for it? >> >> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote: >> > in my spark-defaults.conf I have: >> > spark.files file1.zip, file2.py >> > spark.master spark://master.domain.com:7077 >> > >> > If I execute: >> > bin/pyspark >> > >> > I can see it adding the files correctly. >> > >> > However if I execute >> > >> > bin/spark-submit test.py >> > >> > where test.py relies on the file1.zip, I get and error. >> > >> > If I i instead execute >> > >> > bin/spark-submit --py-files file1.zip test.py >> > >> > It works as expected. >> > >> > How do I get spark-submit to import the spark-defaults.conf file or what >> > should I start checking to figure out why one works and the other >> > doesn't? >> > >> > Thanks, >> > >> > -Axel > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-submit not using conf/spark-defaults.conf
logged it here: https://issues.apache.org/jira/browse/SPARK-10436 On Thu, Sep 3, 2015 at 10:32 AM, Davies Liu <dav...@databricks.com> wrote: > I think it's a missing feature. > > On Wed, Sep 2, 2015 at 10:58 PM, Axel Dahl <a...@whisperstream.com> wrote: > > So a bit more investigation, shows that: > > > > if I have configured spark-defaults.conf with: > > > > "spark.files library.py" > > > > then if I call > > > > "spark-submit.py -v test.py" > > > > I see that my "spark.files" default option has been replaced with > > "spark.files test.py", basically spark-submit is overwriting > > spark.files with the name of the script. > > > > Is this a bug or is there another way to add default libraries without > > having to specify them on the command line? > > > > Thanks, > > > > -Axel > > > > > > > > On Wed, Sep 2, 2015 at 10:34 PM, Davies Liu <dav...@databricks.com> > wrote: > >> > >> This should be a bug, could you create a JIRA for it? > >> > >> On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> > wrote: > >> > in my spark-defaults.conf I have: > >> > spark.files file1.zip, file2.py > >> > spark.master spark://master.domain.com:7077 > >> > > >> > If I execute: > >> > bin/pyspark > >> > > >> > I can see it adding the files correctly. > >> > > >> > However if I execute > >> > > >> > bin/spark-submit test.py > >> > > >> > where test.py relies on the file1.zip, I get and error. > >> > > >> > If I i instead execute > >> > > >> > bin/spark-submit --py-files file1.zip test.py > >> > > >> > It works as expected. > >> > > >> > How do I get spark-submit to import the spark-defaults.conf file or > what > >> > should I start checking to figure out why one works and the other > >> > doesn't? > >> > > >> > Thanks, > >> > > >> > -Axel > > > > >
spark-submit not using conf/spark-defaults.conf
in my spark-defaults.conf I have: spark.files file1.zip, file2.py spark.master spark://master.domain.com:7077 If I execute: bin/pyspark I can see it adding the files correctly. However if I execute bin/spark-submit test.py where test.py relies on the file1.zip, I get and error. If I i instead execute bin/spark-submit --py-files file1.zip test.py It works as expected. How do I get spark-submit to import the spark-defaults.conf file or what should I start checking to figure out why one works and the other doesn't? Thanks, -Axel
Re: spark-submit not using conf/spark-defaults.conf
This should be a bug, could you create a JIRA for it? On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote: > in my spark-defaults.conf I have: > spark.files file1.zip, file2.py > spark.master spark://master.domain.com:7077 > > If I execute: > bin/pyspark > > I can see it adding the files correctly. > > However if I execute > > bin/spark-submit test.py > > where test.py relies on the file1.zip, I get and error. > > If I i instead execute > > bin/spark-submit --py-files file1.zip test.py > > It works as expected. > > How do I get spark-submit to import the spark-defaults.conf file or what > should I start checking to figure out why one works and the other doesn't? > > Thanks, > > -Axel - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: spark-submit not using conf/spark-defaults.conf
So a bit more investigation, shows that: if I have configured spark-defaults.conf with: "spark.files library.py" then if I call "spark-submit.py -v test.py" I see that my "spark.files" default option has been replaced with "spark.files test.py", basically spark-submit is overwriting spark.files with the name of the script. Is this a bug or is there another way to add default libraries without having to specify them on the command line? Thanks, -Axel On Wed, Sep 2, 2015 at 10:34 PM, Davies Liu <dav...@databricks.com> wrote: > This should be a bug, could you create a JIRA for it? > > On Wed, Sep 2, 2015 at 4:38 PM, Axel Dahl <a...@whisperstream.com> wrote: > > in my spark-defaults.conf I have: > > spark.files file1.zip, file2.py > > spark.master spark://master.domain.com:7077 > > > > If I execute: > > bin/pyspark > > > > I can see it adding the files correctly. > > > > However if I execute > > > > bin/spark-submit test.py > > > > where test.py relies on the file1.zip, I get and error. > > > > If I i instead execute > > > > bin/spark-submit --py-files file1.zip test.py > > > > It works as expected. > > > > How do I get spark-submit to import the spark-defaults.conf file or what > > should I start checking to figure out why one works and the other > doesn't? > > > > Thanks, > > > > -Axel >
RE: How to register array class with Kyro in spark-defaults.conf
Does anybody have any idea how to solve this problem? Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, July 30, 2015 11:06 AM To: user@spark.apache.org Subject: How to register array class with Kyro in spark-defaults.conf I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following exception java.lang.IllegalArgumentException: Class is not registered: ltn.analytics.es.EsDoc[] Note: To register this class use: kryo.register(ltn.analytics.es.EsDoc[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) The error message seems to suggest that I should also register the array class EsDoc[]. So I add it to spark-defaults.conf as follow spark.kryo.classesToRegister ltn.analytics.es.EsDoc,ltn.analytics.es.EsDoc[] Then I got the following error org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:101) at org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:153) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:115) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:200) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) at ltn.analytics.index.Index.addDocuments(Index.scala:82) Please advise. Thanks. Ningjun
RE: How to register array class with Kyro in spark-defaults.conf
Here is the definition of EsDoc case class EsDoc(id: Long, isExample: Boolean, docSetIds: Array[String], randomId: Double, vector: String) extends Serializable Note that it is not EsDoc having problem with registration. It is the EsDoc[] (the array class of EsDoc) that has problem with registration. I have tried to replace the class EsDoc by the Map class, I also got the following error ask me to register the Map[] (array of Map) class java.lang.IllegalArgumentException: Class is not registered: scala.collection.immutable.Map[] Note: To register this class use: kryo.register(scala.collection.immutable.Map[].class); So the question is how to register Array class? Adding the following in spark-defauls.conf does not work spark.kryo.classesToRegister scala.collection.immutable.Map,scala.collection.immutable.Map[] Ningjun From: Ted Yu [mailto:yuzhih...@gmail.com] Sent: Friday, July 31, 2015 11:49 AM To: Wang, Ningjun (LNG-NPV) Cc: user@spark.apache.org Subject: Re: How to register array class with Kyro in spark-defaults.conf For the second exception, was there anything following SparkException which would give us more clue ? Can you tell us how EsDoc is structured ? Thanks On Fri, Jul 31, 2015 at 8:42 AM, Wang, Ningjun (LNG-NPV) ningjun.w...@lexisnexis.commailto:ningjun.w...@lexisnexis.com wrote: Does anybody have any idea how to solve this problem? Ningjun From: Wang, Ningjun (LNG-NPV) Sent: Thursday, July 30, 2015 11:06 AM To: user@spark.apache.orgmailto:user@spark.apache.org Subject: How to register array class with Kyro in spark-defaults.conf I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following exception java.lang.IllegalArgumentException: Class is not registered: ltn.analytics.es.EsDoc[] Note: To register this class use: kryo.register(ltn.analytics.es.EsDoc[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) The error message seems to suggest that I should also register the array class EsDoc[]. So I add it to spark-defaults.conf as follow spark.kryo.classesToRegister ltn.analytics.es.EsDoc,ltn.analytics.es.EsDoc[] Then I got the following error org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:101) at org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:153) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:115) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:200) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) at ltn.analytics.index.Index.addDocuments(Index.scala:82) Please advise. Thanks. Ningjun
How to register array class with Kyro in spark-defaults.conf
I register my class with Kyro in spark-defaults.conf as follow spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryo.registrationRequired true spark.kryo.classesToRegister ltn.analytics.es.EsDoc But I got the following exception java.lang.IllegalArgumentException: Class is not registered: ltn.analytics.es.EsDoc[] Note: To register this class use: kryo.register(ltn.analytics.es.EsDoc[].class); at com.esotericsoftware.kryo.Kryo.getRegistration(Kryo.java:442) at com.esotericsoftware.kryo.util.DefaultClassResolver.writeClass(DefaultClassResolver.java:79) at com.esotericsoftware.kryo.Kryo.writeClass(Kryo.java:472) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:565) at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:162) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) The error message seems to suggest that I should also register the array class EsDoc[]. So I add it to spark-defaults.conf as follow spark.kryo.classesToRegister ltn.analytics.es.EsDoc,ltn.analytics.es.EsDoc[] Then I got the following error org.apache.spark.SparkException: Failed to register classes with Kryo at org.apache.spark.serializer.KryoSerializer.newKryo(KryoSerializer.scala:101) at org.apache.spark.serializer.KryoSerializerInstance.init(KryoSerializer.scala:153) at org.apache.spark.serializer.KryoSerializer.newInstance(KryoSerializer.scala:115) at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:200) at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:101) at org.apache.spark.broadcast.TorrentBroadcast.init(TorrentBroadcast.scala:84) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:29) at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1051) at ltn.analytics.index.Index.addDocuments(Index.scala:82) Please advise. Thanks. Ningjun
Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
Would there be any problem in having spark.executor.instances (or --num-executors) be completely ignored (i.e., even for non-zero values) if spark.dynamicAllocation.enabled is true (i.e., rather than throwing an exception)? I can see how the exception would be helpful if, say, you tried to pass both -c spark.executor.instances (or --num-executors) *and* -c spark.dynamicAllocation.enabled=true to spark-submit on the command line (as opposed to having one of them in spark-defaults.conf and one of them in the spark-submit args), but currently there doesn't seem to be any way to distinguish between arguments that were actually passed to spark-submit and settings that simply came from spark-defaults.conf. If there were a way to distinguish them, I think the ideal situation would be for the validation exception to be thrown only if spark.executor.instances and spark.dynamicAllocation.enabled=true were both passed via spark-submit args or were both present in spark-defaults.conf, but passing spark.dynamicAllocation.enabled=true to spark-submit would take precedence over spark.executor.instances configured in spark-defaults.conf, and vice versa. Jonathan Kelly Elastic MapReduce - SDE Blackfoot (SEA33) 06.850.F0 From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) ... The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
bump From: Jonathan Kelly jonat...@amazon.commailto:jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) ... The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15, 2015 at 3:18 PM, Kelly, Jonathan jonat...@amazon.com wrote: Would there be any problem in having spark.executor.instances (or --num-executors) be completely ignored (i.e., even for non-zero values) if spark.dynamicAllocation.enabled is true (i.e., rather than throwing an exception)? I can see how the exception would be helpful if, say, you tried to pass both -c spark.executor.instances (or --num-executors) *and* -c spark.dynamicAllocation.enabled=true to spark-submit on the command line (as opposed to having one of them in spark-defaults.conf and one of them in the spark-submit args), but currently there doesn't seem to be any way to distinguish between arguments that were actually passed to spark-submit and settings that simply came from spark-defaults.conf. If there were a way to distinguish them, I think the ideal situation would be for the validation exception to be thrown only if spark.executor.instances and spark.dynamicAllocation.enabled=true were both passed via spark-submit args or were both present in spark-defaults.conf, but passing spark.dynamicAllocation.enabled=true to spark-submit would take precedence over spark.executor.instances configured in spark-defaults.conf, and vice versa. Jonathan Kelly Elastic MapReduce - SDE Blackfoot (SEA33) 06.850.F0 From: Jonathan Kelly jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.org user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) … The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
Yeah, we could make it a log a warning instead. 2015-07-15 14:29 GMT-07:00 Kelly, Jonathan jonat...@amazon.com: Thanks! Is there an existing JIRA I should watch? ~ Jonathan From: Sandy Ryza sandy.r...@cloudera.com Date: Wednesday, July 15, 2015 at 2:27 PM To: Jonathan Kelly jonat...@amazon.com Cc: user@spark.apache.org user@spark.apache.org Subject: Re: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf Hi Jonathan, This is a problem that has come up for us as well, because we'd like dynamic allocation to be turned on by default in some setups, but not break existing users with these properties. I'm hoping to figure out a way to reconcile these by Spark 1.5. -Sandy On Wed, Jul 15, 2015 at 3:18 PM, Kelly, Jonathan jonat...@amazon.com wrote: Would there be any problem in having spark.executor.instances (or --num-executors) be completely ignored (i.e., even for non-zero values) if spark.dynamicAllocation.enabled is true (i.e., rather than throwing an exception)? I can see how the exception would be helpful if, say, you tried to pass both -c spark.executor.instances (or --num-executors) *and* -c spark.dynamicAllocation.enabled=true to spark-submit on the command line (as opposed to having one of them in spark-defaults.conf and one of them in the spark-submit args), but currently there doesn't seem to be any way to distinguish between arguments that were actually passed to spark-submit and settings that simply came from spark-defaults.conf. If there were a way to distinguish them, I think the ideal situation would be for the validation exception to be thrown only if spark.executor.instances and spark.dynamicAllocation.enabled=true were both passed via spark-submit args or were both present in spark-defaults.conf, but passing spark.dynamicAllocation.enabled=true to spark-submit would take precedence over spark.executor.instances configured in spark-defaults.conf, and vice versa. Jonathan Kelly Elastic MapReduce - SDE Blackfoot (SEA33) 06.850.F0 From: Jonathan Kelly jonat...@amazon.com Date: Tuesday, July 14, 2015 at 4:23 PM To: user@spark.apache.org user@spark.apache.org Subject: Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) … The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Unable to use dynamicAllocation if spark.executor.instances is set in spark-defaults.conf
I've set up my cluster with a pre-calcualted value for spark.executor.instances in spark-defaults.conf such that I can run a job and have it maximize the utilization of the cluster resources by default. However, if I want to run a job with dynamicAllocation (by passing -c spark.dynamicAllocation.enabled=true to spark-submit), I get this exception: Exception in thread main java.lang.IllegalArgumentException: Explicitly setting the number of executors is not compatible with spark.dynamicAllocation.enabled! at org.apache.spark.deploy.yarn.ClientArguments.parseArgs(ClientArguments.scala:192) at org.apache.spark.deploy.yarn.ClientArguments.init(ClientArguments.scala:59) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:54) ... The exception makes sense, of course, but ideally I would like it to ignore what I've put in spark-defaults.conf for spark.executor.instances if I've enabled dynamicAllocation. The most annoying thing about this is that if I have spark.executor.instances present in spark-defaults.conf, I cannot figure out any way to spark-submit a job with spark.dynamicAllocation.enabled=true without getting this error. That is, even if I pass -c spark.executor.instances=0 -c spark.dynamicAllocation.enabled=true, I still get this error because the validation in ClientArguments.parseArgs() that's checking for this condition simply checks for the presence of spark.executor.instances rather than whether or not its value is 0. Should the check be changed to allow spark.executor.instances to be set to 0 if spark.dynamicAllocation.enabled is true? That would be an OK compromise, but I'd really prefer to be able to enable dynamicAllocation simply by setting spark.dynamicAllocation.enabled=true rather than by also having to set spark.executor.instances to 0. Thanks, Jonathan
Re: Difference between spark-defaults.conf and SparkConf.set
Thanks. Without spark submit it seems the more straightforward solution is to just pass it on the driver's classpath. I was more surprised that the same conf parameter had different behavior depending on where it's specified. Program vs spark-defaults. Im all set now- thanks for replying div Original message /divdivFrom: Akhil Das ak...@sigmoidanalytics.com /divdivDate:07/01/2015 2:27 AM (GMT-05:00) /divdivTo: Yana Kadiyska yana.kadiy...@gmail.com /divdivCc: user@spark.apache.org /divdivSubject: Re: Difference between spark-defaults.conf and SparkConf.set /divdiv /div.addJar works for me when i run it as a stand-alone application (without using spark-submit) Thanks Best Regards On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath spark.driver.extraClassPath to point to some external JARs. If I set them in spark-defaults.conf everything works perfectly. However, if I remove spark-defaults.conf and just create a SparkConf and call .set(spark.executor.extraClassPath,...) .set(spark.driver.extraClassPath,...) I get ClassNotFound exceptions from Hadoop Conf: Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ceph.CephFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585) This seems like a bug to me -- or does spark-defaults.conf somehow get processed differently? I have dumped out sparkConf.toDebugString and in both cases (spark-defaults.conf/in code sets) it seems to have the same values in it...
Re: Difference between spark-defaults.conf and SparkConf.set
.addJar works for me when i run it as a stand-alone application (without using spark-submit) Thanks Best Regards On Tue, Jun 30, 2015 at 7:47 PM, Yana Kadiyska yana.kadiy...@gmail.com wrote: Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath spark.driver.extraClassPath to point to some external JARs. If I set them in spark-defaults.conf everything works perfectly. However, if I remove spark-defaults.conf and just create a SparkConf and call .set(spark.executor.extraClassPath,...) .set(spark.driver.extraClassPath,...) I get ClassNotFound exceptions from Hadoop Conf: Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ceph.CephFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585) This seems like a bug to me -- or does spark-defaults.conf somehow get processed differently? I have dumped out sparkConf.toDebugString and in both cases (spark-defaults.conf/in code sets) it seems to have the same values in it...
Difference between spark-defaults.conf and SparkConf.set
Hi folks, running into a pretty strange issue: I'm setting spark.executor.extraClassPath spark.driver.extraClassPath to point to some external JARs. If I set them in spark-defaults.conf everything works perfectly. However, if I remove spark-defaults.conf and just create a SparkConf and call .set(spark.executor.extraClassPath,...) .set(spark.driver.extraClassPath,...) I get ClassNotFound exceptions from Hadoop Conf: Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.ceph.CephFileSystem not found at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:1493) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:1585) This seems like a bug to me -- or does spark-defaults.conf somehow get processed differently? I have dumped out sparkConf.toDebugString and in both cases (spark-defaults.conf/in code sets) it seems to have the same values in it...
Re: spark-defaults.conf
So no takers regarding why spark-defaults.conf is not being picked up. Here is another one: If Zookeeper is configured in Spark why do we need to start a slave like this: spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh 1 spark://somemaster:7077 i.e. why do we need to specify the master url explicitly Shouldn't Spark just consult with ZK and us the active master? Or is ZK only used during failure? On Mon, Apr 27, 2015 at 1:53 PM, James King jakwebin...@gmail.com wrote: Thanks. I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile But when I start worker like this spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh I still get failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/04/27 11:51:33 DEBUG Utils: Shutdown hook called On Mon, Apr 27, 2015 at 1:15 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote: I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. But I'm thinking it should pick up the default spark-defaults.conf from conf dir Am I expecting or doing something wrong? Regards jk
spark-defaults.conf
I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. But I'm thinking it should pick up the default spark-defaults.conf from conf dir Am I expecting or doing something wrong? Regards jk
Re: spark-defaults.conf
You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote: I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. But I'm thinking it should pick up the default spark-defaults.conf from conf dir Am I expecting or doing something wrong? Regards jk
Re: spark-defaults.conf
Thanks. I've set SPARK_HOME and SPARK_CONF_DIR appropriately in .bash_profile But when I start worker like this spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh I still get failed to launch org.apache.spark.deploy.worker.Worker: Default is conf/spark-defaults.conf. 15/04/27 11:51:33 DEBUG Utils: Shutdown hook called On Mon, Apr 27, 2015 at 1:15 PM, Zoltán Zvara zoltan.zv...@gmail.com wrote: You should distribute your configuration file to workers and set the appropriate environment variables, like HADOOP_HOME, SPARK_HOME, HADOOP_CONF_DIR, SPARK_CONF_DIR. On Mon, Apr 27, 2015 at 12:56 PM James King jakwebin...@gmail.com wrote: I renamed spark-defaults.conf.template to spark-defaults.conf and invoked spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh But I still get failed to launch org.apache.spark.deploy.worker.Worker: --properties-file FILE Path to a custom Spark properties file. Default is conf/spark-defaults.conf. But I'm thinking it should pick up the default spark-defaults.conf from conf dir Am I expecting or doing something wrong? Regards jk
Spark 1.2, trying to run spark-history as a service, spark-defaults.conf are ignored
Here is related problem: http://apache-spark-user-list.1001560.n3.nabble.com/Launching-history-server-problem-td12574.html but no answer. What I'm trying to do: wrap spark-history with /etc/init.d script Problems I have: can't make it read spark-defaults.conf I've put this file here: /etc/spark/conf /usr/lib/spark/conf where /usr/lib/spark is locaition for spark no luck. spark-history tries to use default value for applications log location, it doesn't read specified value from spark-defaults.conf
Can value in spark-defaults.conf support system variables?
Hi,all: Can value in spark-defaults.conf support system variables? Such as mess = ${user.home}/${user.name}. Best Regards Zhanfeng Huo
Re: Can value in spark-defaults.conf support system variables?
No, not currently. 2014-09-01 2:53 GMT-07:00 Zhanfeng Huo huozhanf...@gmail.com: Hi,all: Can value in spark-defaults.conf support system variables? Such as mess = ${user.home}/${user.name}. Best Regards -- Zhanfeng Huo
Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir
Hi All, Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 when you specify the location in conf/spark-defaults.conf for spark.eventLog.dir hdfs:///user/$USER/spark/logs to use the $USER env variable. For example, I'm running the command with user 'test'. In spark-submit, the folder will be created on-the-fly and you will see the event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152 but in spark-shell, the user 'test' folder is not created, and you will see this /user/$USER/spark/logs on HDFS. It will try to create /user/$USER/spark/logs instead of /user/test/spark/logs. It looks like spark-shell couldn't pick up the env variable $USER to apply for the eventLog directory for the running user 'test'. Is this considered a bug or bad practice to use spark-shell with Spark's HistoryServer?
Re: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir
Hi Andrew, It's definitely not bad practice to use spark-shell with HistoryServer. The issue here is not with spark-shell, but the way we pass Spark configs to the application. spark-defaults.conf does not currently support embedding environment variables, but instead interprets everything as a string literal. You will have to manually specify test instead of $USER in the path you provide to spark.eventLog.dir. -Andrew 2014-07-28 12:40 GMT-07:00 Andrew Lee alee...@hotmail.com: Hi All, Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 when you specify the location in *conf/spark-defaults.conf* for spark.eventLog.dir hdfs:///user/$USER/spark/logs to use the *$USER* env variable. For example, I'm running the command with user 'test'. In *spark-submit*, the folder will be created on-the-fly and you will see the event logs created on HDFS */user/test/spark/logs/spark-pi-1405097484152* but in *spark-shell*, the user 'test' folder is not created, and you will see this */user/$USER/spark/logs* on HDFS. It will try to create */user/$USER/spark/logs* instead of */user/test/spark/logs*. It looks like spark-shell couldn't pick up the env variable $USER to apply for the eventLog directory for the running user 'test'. Is this considered a bug or bad practice to use spark-shell with Spark's HistoryServer?
RE: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir
Hi Andrew, Thanks to re-confirm the problem. I thought it only happens to my own build. :) by the way, we have multiple users using the spark-shell to explore their dataset, and we are continuously looking into ways to isolate their jobs history. In the current situation, we can't really ask them to create their own spark-defaults.conf since this is set to read-only. A workaround is to set it to a shared folder e.g. /user/spark/logs and user permission 1777. This isn't really ideal since other people can see what are the other jobs running on the shared cluster. It will be nice to have a better security if this is enhanced so people aren't exposing their algorithm (which is usually embed in their job's name) to other users. Will there or is there a JIRA ticket to keep track of this? any plan to enhance this part for spark-shell ? Date: Mon, 28 Jul 2014 13:54:56 -0700 Subject: Re: Issues on spark-shell and spark-submit behave differently on spark-defaults.conf parameter spark.eventLog.dir From: and...@databricks.com To: user@spark.apache.org Hi Andrew, It's definitely not bad practice to use spark-shell with HistoryServer. The issue here is not with spark-shell, but the way we pass Spark configs to the application. spark-defaults.conf does not currently support embedding environment variables, but instead interprets everything as a string literal. You will have to manually specify test instead of $USER in the path you provide to spark.eventLog.dir. -Andrew 2014-07-28 12:40 GMT-07:00 Andrew Lee alee...@hotmail.com: Hi All, Not sure if anyone has ran into this problem, but this exist in spark 1.0.0 when you specify the location in conf/spark-defaults.conf for spark.eventLog.dir hdfs:///user/$USER/spark/logs to use the $USER env variable. For example, I'm running the command with user 'test'. In spark-submit, the folder will be created on-the-fly and you will see the event logs created on HDFS /user/test/spark/logs/spark-pi-1405097484152 but in spark-shell, the user 'test' folder is not created, and you will see this /user/$USER/spark/logs on HDFS. It will try to create /user/$USER/spark/logs instead of /user/test/spark/logs. It looks like spark-shell couldn't pick up the env variable $USER to apply for the eventLog directory for the running user 'test'. Is this considered a bug or bad practice to use spark-shell with Spark's HistoryServer?