Re: Can't submit job to stand alone cluster
On 12/28/15, 5:16 PM, "Daniel Valdivia"wrote: >Hi, > >I'm trying to submit a job to a small spark cluster running in stand >alone mode, however it seems like the jar file I'm submitting to the >cluster is "not found" by the workers nodes. > >I might have understood wrong, but I though the Driver node would send >this jar file to the worker nodes, or should I manually send this file to >each worker node before I submit the job? Yes, you have misunderstood, but so did I. So the problem is that --deploy-mode cluster runs the Driver on the cluster as well, and you don't know which node it's going to run on, so every node needs access to the JAR. spark-submit does not pass the JAR along to the Driver, but the Driver will pass it to the executors. I ended up putting the JAR in HDFS and passing an hdfs:// path to spark-submit. This is a subtle difference from Spark on YARN which does pass the JAR along to the Driver automatically, and IMO should probably be fixed in spark-submit. It's really confusing for newcomers. Another problem I ran into that you also might is that --packages doesn't work with --deploy-mode cluster. It downloads the packages to a temporary location on the node running spark-submit, then passes those paths to the node that is running the Driver, but since that isn't the same machine, it can't find anything and fails. The driver process *should* be the one doing the downloading, but it isn't. I ended up having to create a fat JAR with all of the dependencies to get around that one. Greg - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
spark-submit problems with --packages and --deploy-mode cluster
I'm using Spark 1.5.0 with the standalone scheduler, and for the life of me I can't figure out why this isn't working. I have an application that works fine with --deploy-mode client that I'm trying to get to run in cluster mode so I can use --supervise. I ran into a few issues with my configuration that I had to sort out (classpath stuff mostly), but now I'm stumped. We rely on the databricks spark csv plugin. We're loading that using --packages "com.databricks:spark-csv_2.11:1.2.0". This works without issue in client mode, but when run in cluster mode, it tries to load the spark-csv jar from /root/.ivy2 and fails because that folder doesn't exist on the slave node that ends up running the driver. Does --packages not work when the driver is loaded on the cluster? Does it download the jars in the client before loading the driver on the cluster and doesn't pass along the downloaded JARs? Here's my stderr output: https://gist.github.com/jimbobhickville/1f10b3508ef946eccb92 Thanks in advance for any suggestions. Greg
Re: SPARK_SUBMIT_CLASSPATH question
I guess I was a little light on the details in my haste. I'm using Spark on YARN, and this is in the driver process in yarn-client mode (most notably spark-shell). I've had to manually add a bunch of JARs that I had thought it would just pick up like everything else does: export SPARK_SUBMIT_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop/lib/native/Linux-amd64-64:$SPARK_SUBMIT_LIBRARY_PATH export SPARK_SUBMIT_CLASSPATH=/usr/lib/hadoop/lib/hadoop-openstack-2.4.0.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/spark-yarn/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark-yarn/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark-yarn/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar:$SPARK_SUBMIT_CLASSPATH The lzo jar and the SPARK_SUBMIT_LIBRARY_PATH were required to get anything at all to work. Without them, basic communication failed because it couldn't load the lzo library to compress/decompress the data. The datanucleus stuff was required for hive on spark, and the hadoop-openstack and jackson jars are for the swiftfs hdfs plugin to work from within spark-shell. I tried stuff like: export SPARK_SUBMIT_CLASSPATH=/usr/lib/hadoop/lib/* But that didn't work at all. I have to specify every individual jar like that. Is there something I'm missing or some easier way to accomplish this? I'm worried that I'll keep finding more missing dependencies as we explore other features and the classpath string is going to take up a whole screen. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Tuesday, October 14, 2014 1:57 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: SPARK_SUBMIT_CLASSPATH question It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as other tools to put wildcards in the paths you add. For some reason it doesn't pick up the classpath information from yarn-site.xml either, it seems, when running on YARN. I'm having to manually add every single dependency JAR. There must be a better way, so what am I missing? Greg
SPARK_SUBMIT_CLASSPATH question
It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as other tools to put wildcards in the paths you add. For some reason it doesn't pick up the classpath information from yarn-site.xml either, it seems, when running on YARN. I'm having to manually add every single dependency JAR. There must be a better way, so what am I missing? Greg
Re: Spark on YARN driver memory allocation bug?
$MASTER is 'yarn-cluster' in spark-env.sh spark-submit --driver-memory 12424m --class org.apache.spark.examples.SparkPi /usr/lib/spark-yarn/lib/spark-examples*.jar 1000 OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0006fd28, 4342677504, 0) failed; error='Cannot allocate memory' (errno=12) # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 4342677504 bytes for committing reserved memory. # An error report file with more information is saved as: # /tmp/jvm-3525/hs_error.log From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Wednesday, October 8, 2014 3:25 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark on YARN driver memory allocation bug? Hi Greg, It does seem like a bug. What is the particular exception message that you see? Andrew 2014-10-08 12:12 GMT-07:00 Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com: So, I think this is a bug, but I wanted to get some feedback before I reported it as such. On Spark on YARN, 1.1.0, if you specify the --driver-memory value to be higher than the memory available on the client machine, Spark errors out due to failing to allocate enough memory. This happens even in yarn-cluster mode. Shouldn't it only allocate that memory on the YARN node that is going to run the driver process, not the local client machine? Greg
Spark on YARN driver memory allocation bug?
So, I think this is a bug, but I wanted to get some feedback before I reported it as such. On Spark on YARN, 1.1.0, if you specify the --driver-memory value to be higher than the memory available on the client machine, Spark errors out due to failing to allocate enough memory. This happens even in yarn-cluster mode. Shouldn't it only allocate that memory on the YARN node that is going to run the driver process, not the local client machine? Greg
Re: Spark with YARN
Do you have YARN_CONF_DIR set in your environment to point Spark to where your yarn configs are? Greg From: Raghuveer Chanda raghuveer.cha...@gmail.commailto:raghuveer.cha...@gmail.com Date: Wednesday, September 24, 2014 12:25 PM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Subject: Spark with YARN Hi, I'm new to spark and facing problem with running a job in cluster using YARN. Initially i ran jobs using spark master as --master spark://dml2:7077 and it is running fine on 3 workers. But now im shifting to YARN, so installed YARN in cloud era on 3 node cluster and changed the master to yarn-cluster but it is not working I attached the screenshots of UI which are not progressing and just hanging on. Output on terminal : This error is repeating ./spark-submit --class class-name --master yarn-cluster --num-executors 3 --executor-cores 3 jar-with-dependencies.jar Do i need to configure YARN or why it is not getting all the workers .. please help ... 14/09/24 22:44:21 INFO yarn.Client: Application report from ASM: application identifier: application_1411578463780_0001 appId: 1 clientToAMToken: null appDiagnostics: appMasterHost: dml3 appQueue: root.chanda appMasterRpcPort: 0 appStartTime: 1411578513545 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: http://dml2:8088/proxy/application_1411578463780_0001/ appUser: chanda 14/09/24 22:44:22 INFO yarn.Client: Application report from ASM: application identifier: application_1411578463780_0001 appId: 1 clientToAMToken: null appDiagnostics: appMasterHost: dml3 appQueue: root.chanda appMasterRpcPort: 0 appStartTime: 1411578513545 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: http://dml2:8088/proxy/application_1411578463780_0001/ -- Regards, Raghuveer Chanda 4th year Undergraduate Student Computer Science and Engineering IIT Kharagpur
Re: clarification for some spark on yarn configuration options
Thanks for looking into it. I'm trying to avoid making the user pass in any parameters by configuring it to use the right values for the cluster size by default, hence my reliance on the configuration. I'd rather just use spark-defaults.conf than the environment variables, and looking at the code you modified, I don't see any place it's picking up spark.driver.memory either. Is that a separate bug? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Monday, September 22, 2014 8:11 PM To: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually picked up in cluster mode. This is a bug and I have opened a PR to fix it: https://github.com/apache/spark/pull/2500. For now, please use --driver-memory instead, which should work for both client and cluster mode. Thanks for pointing this out, -Andrew 2014-09-22 14:04 GMT-07:00 Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com: Maybe try --driver-memory if you are using spark-submit? Thanks, Nishkam On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote: Ah, I see. It turns out that my problem is that that comparison is ignoring SPARK_DRIVER_MEMORY and comparing to the default of 512m. Is that a bug that's since fixed? I'm on 1.0.1 and using 'yarn-cluster' as the master. 'yarn-client' seems to pick up the values and works fine. Greg From: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com Date: Monday, September 22, 2014 3:30 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: Andrew Or and...@databricks.commailto:and...@databricks.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Greg, if you look carefully, the code is enforcing that the memoryOverhead be lower (and not higher) than spark.driver.memory. Thanks, Nishkam On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote: I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you
recommended values for spark driver memory?
I know the recommendation is it depends, but can people share what sort of memory allocations they're using for their driver processes? I'd like to get an idea of what the range looks like so we can provide sensible defaults without necessarily knowing what the jobs will look like. The customer can then tweak that if they need to for their particular job. Thanks in advance. Greg
Re: clarification for some spark on yarn configuration options
I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg
Re: clarification for some spark on yarn configuration options
Gah, ignore me again. I was reading the logic backwards. For some reason it isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the default of 512m. Probably an environmental issue. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Monday, September 22, 2014 3:26 PM To: Andrew Or and...@databricks.commailto:and...@databricks.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options I thought I had this all figured out, but I'm getting some weird errors now that I'm attempting to deploy this on production-size servers. It's complaining that I'm not allocating enough memory to the memoryOverhead values. I tracked it down to this code: https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70 Unless I'm reading it wrong, those checks are enforcing that you set spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but that makes no sense to me since that memory is just supposed to be what YARN needs on top of what you're allocating for Spark. My understanding was that the overhead values should be quite a bit lower (and by default they are). Also, why must the executor be allocated less memory than the driver's memory overhead value? What am I misunderstanding here? Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 9, 2014 5:49 PM To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: clarification for some spark on yarn configuration options Hi Greg, SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The equivalent spark.executor.instances is just another way to set the same thing in your spark-defaults.conf. Maybe this should be documented. :) spark.yarn.executor.memoryOverhead is just an additional margin added to spark.executor.memory for the container. In addition to the executor's memory, the container in which the executor is launched needs some extra memory for system processes, and this is what this overhead (somewhat of a misnomer) is for. The same goes for the driver equivalent. spark.driver.memory behaves differently depending on which version of Spark you are using. If you are using Spark 1.1+ (this was released very recently), you can directly set spark.driver.memory and this will take effect. Otherwise, setting this doesn't actually do anything for client deploy mode, and you have two alternatives: (1) set the environment variable equivalent SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the --driver-memory command line argument. If you want your PySpark application (driver) to pick up extra class path, you can pass the --driver-class-path to Spark submit. If you are using Spark 1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. There is also an environment variable you could set (SPARK_CLASSPATH), though this is now deprecated. Let me know if you have more questions about these options, -Andrew 2014-09-08 6:59 GMT-07:00 Greg Hill greg.h...@rackspace.commailto:greg.h...@rackspace.com: Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the workers per slave node? Is spark.executor.instances an actual config option? I found that in a commit, but it's not in the docs. What is the difference between spark.yarn.executor.memoryOverhead and spark.executor.memory ? Same question for the 'driver' variant, but I assume it's the same answer. Is there a spark.driver.memory option that's undocumented or do you have to use the environment variable SPARK_DRIVER_MEMORY? What config option or environment variable do I need to set to get pyspark interactive to pick up the yarn class path? The ones that work for spark-shell and spark-submit don't seem to work for pyspark. Thanks in advance. Greg
Re: spark on yarn history server + hdfs permissions issue
To answer my own question, in case someone else runs into this. The spark user needs to be in the same group on the namenode, and hdfs caches that information for it seems like at least an hour. Magically started working on its own. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Tuesday, September 9, 2014 2:30 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: spark on yarn history server + hdfs permissions issue I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having issues getting the spark history server permissions to read the spark event logs from hdfs. Both sides are configured to write/read logs from: hdfs:///apps/spark/events The history server is running as user spark, the jobs are running as user lavaqe. Both users are in the hdfs group on all the nodes in the cluster. That root logs folder is globally writeable, but owned by the spark user: drwxrwxrwx - spark hdfs 0 2014-09-09 18:19 /apps/spark/events All good so far. Spark jobs create subfolders and put their event logs in there just fine. The problem is that the history server, running as the spark user, cannot read those logs. They're written as the user that initiates the job, but still in the same hdfs group: drwxrwx--- - lavaqe hdfs 0 2014-09-09 19:24 /apps/spark/events/spark-pi-1410290714996 The files are group readable/writable, but this is the error I get: Permission denied: user=spark, access=READ_EXECUTE, inode=/apps/spark/events/spark-pi-1410290714996:lavaqe:hdfs:drwxrwx--- So, two questions, I guess: 1. Do group permissions just plain not work in hdfs or am I missing something? 2. Is there a way to tell Spark to log with more permissive permissions so the history server can read the generated logs? Greg
spark on yarn history server + hdfs permissions issue
I am running Spark on Yarn with the HDP 2.1 technical preview. I'm having issues getting the spark history server permissions to read the spark event logs from hdfs. Both sides are configured to write/read logs from: hdfs:///apps/spark/events The history server is running as user spark, the jobs are running as user lavaqe. Both users are in the hdfs group on all the nodes in the cluster. That root logs folder is globally writeable, but owned by the spark user: drwxrwxrwx - spark hdfs 0 2014-09-09 18:19 /apps/spark/events All good so far. Spark jobs create subfolders and put their event logs in there just fine. The problem is that the history server, running as the spark user, cannot read those logs. They're written as the user that initiates the job, but still in the same hdfs group: drwxrwx--- - lavaqe hdfs 0 2014-09-09 19:24 /apps/spark/events/spark-pi-1410290714996 The files are group readable/writable, but this is the error I get: Permission denied: user=spark, access=READ_EXECUTE, inode=/apps/spark/events/spark-pi-1410290714996:lavaqe:hdfs:drwxrwx--- So, two questions, I guess: 1. Do group permissions just plain not work in hdfs or am I missing something? 2. Is there a way to tell Spark to log with more permissive permissions so the history server can read the generated logs? Greg
Re: pyspark on yarn hdp hortonworks
I'm running into a problem getting this working as well. I have spark-submit and spark-shell working fine, but pyspark in interactive mode can't seem to find the lzo jar: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found This is in /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar which is in my SPARK_CLASSPATH environment variable, but that doesn't seem to be picked up by pyspark. Any ideas? I can't find much in the way of docs on getting the environment right for pyspark. Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Wednesday, September 3, 2014 4:19 PM To: Oleg Ruchovets oruchov...@gmail.commailto:oruchov...@gmail.com Cc: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: pyspark on yarn hdp hortonworks Hi Oleg, There isn't much you need to do to setup a Yarn cluster to run PySpark. You need to make sure all machines have python installed, and... that's about it. Your assembly jar will be shipped to all containers along with all the pyspark and py4j files needed. One caveat, however, is that the jar needs to be built in maven and not on a Red Hat-based OS, http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn In addition, it should be built with Java 6 because of a known issue with building jars with Java 7 and including python files in them (https://issues.apache.org/jira/browse/SPARK-1718). Lastly, if you have trouble getting it to work, you can follow the steps I have listed in a different thread to figure out what's wrong: http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e Let me know if you can get it working, -Andrew 2014-09-03 5:03 GMT-07:00 Oleg Ruchovets oruchov...@gmail.commailto:oruchov...@gmail.com: Hi all. I am trying to run pyspark on yarn already couple of days: http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/ I posted exception on previous posts. It looks that I didn't do correct configuration. I googled quite a lot and I can't find the steps should be done to configure PySpark running on Yarn. Can you please share the steps (critical points) should be configured to use PaSpark on Yarn ( hortonworks distribution) : Environment variables. Classpath copy jars to all machine other configuration. Thanks Oleg.
spark history server trying to hit port 8021
My Spark history server won't start because it's trying to hit the namenode on 8021, but the namenode is on 8020 (the default). How can I configure the history server to use the right port? I can't find any relevant setting on the docs: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html Greg
Re: spark history server trying to hit port 8021
Nevermind, PEBKAC. I had put in the wrong port in the $LOG_DIR environment variable. Greg From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com Date: Wednesday, September 3, 2014 1:56 PM To: user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: spark history server trying to hit port 8021 My Spark history server won't start because it's trying to hit the namenode on 8021, but the namenode is on 8020 (the default). How can I configure the history server to use the right port? I can't find any relevant setting on the docs: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html Greg
Spark on YARN question
I'm working on setting up Spark on YARN using the HDP technical preview - http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/ I have installed the Spark JARs on all the slave nodes and configured YARN to find the JARs. It seems like everything is working. Unless I'm misunderstanding, it seems like there isn't any configuration required on the YARN slave nodes at all, apart from telling YARN where to find the Spark JAR files. Do the YARN processes even pick up local Spark configuration files on the slave nodes, or is that all just pulled in on the client and passed along to YARN? Greg
Re: Spark on YARN question
Thanks. That sounds like how I was thinking it worked. I did have to install the JARs on the slave nodes for yarn-cluster mode to work, FWIW. It's probably just whichever node ends up spawning the application master that needs it, but it wasn't passed along from spark-submit. Greg From: Andrew Or and...@databricks.commailto:and...@databricks.com Date: Tuesday, September 2, 2014 11:05 AM To: Matt Narrell matt.narr...@gmail.commailto:matt.narr...@gmail.com Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, user@spark.apache.orgmailto:user@spark.apache.org user@spark.apache.orgmailto:user@spark.apache.org Subject: Re: Spark on YARN question Hi Greg, You should not need to even manually install Spark on each of the worker nodes or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. the assembly + additional jars) to each of the containers for you. You can specify additional jars that your application depends on through the --jars argument if you are using spark-submit / spark-shell / pyspark. As for environment variables, you can specify SPARK_YARN_USER_ENV on the driver node (where your application is submitted) to specify environment variables to be observed by your executors. If you are using the spark-submit / spark-shell / pyspark scripts, then you can set Spark properties in the conf/spark-defaults.conf properties file, and these will be propagated to the executors. In other words, configurations on the slave nodes don't do anything. For example, $ vim conf/spark-defaults.conf // set a few properties $ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2 $ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2 Best, -Andrew