Re: Can't submit job to stand alone cluster

2015-12-29 Thread Greg Hill


On 12/28/15, 5:16 PM, "Daniel Valdivia"  wrote:

>Hi,
>
>I'm trying to submit a job to a small spark cluster running in stand
>alone mode, however it seems like the jar file I'm submitting to the
>cluster is "not found" by the workers nodes.
>
>I might have understood wrong, but I though the Driver node would send
>this jar file to the worker nodes, or should I manually send this file to
>each worker node before I submit the job?

Yes, you have misunderstood, but so did I.  So the problem is that
--deploy-mode cluster runs the Driver on the cluster as well, and you
don't know which node it's going to run on, so every node needs access to
the JAR.  spark-submit does not pass the JAR along to the Driver, but the
Driver will pass it to the executors.  I ended up putting the JAR in HDFS
and passing an hdfs:// path to spark-submit.  This is a subtle difference
from Spark on YARN which does pass the JAR along to the Driver
automatically, and IMO should probably be fixed in spark-submit.  It's
really confusing for newcomers.

Another problem I ran into that you also might is that --packages doesn't
work with --deploy-mode cluster.  It downloads the packages to a temporary
location on the node running spark-submit, then passes those paths to the
node that is running the Driver, but since that isn't the same machine, it
can't find anything and fails.  The driver process *should* be the one
doing the downloading, but it isn't. I ended up having to create a fat JAR
with all of the dependencies to get around that one.

Greg


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



spark-submit problems with --packages and --deploy-mode cluster

2015-12-11 Thread Greg Hill
I'm using Spark 1.5.0 with the standalone scheduler, and for the life of me I 
can't figure out why this isn't working.  I have an application that works fine 
with --deploy-mode client that I'm trying to get to run in cluster mode so I 
can use --supervise.  I ran into a few issues with my configuration that I had 
to sort out (classpath stuff mostly), but now I'm stumped.  We rely on the 
databricks spark csv plugin.  We're loading that using --packages 
"com.databricks:spark-csv_2.11:1.2.0".  This works without issue in client 
mode, but when run in cluster mode, it tries to load the spark-csv jar from 
/root/.ivy2 and fails because that folder doesn't exist on the slave node that 
ends up running the driver.  Does --packages not work when the driver is loaded 
on the cluster?  Does it download the jars in the client before loading the 
driver on the cluster and doesn't pass along the downloaded JARs?

Here's my stderr output:

https://gist.github.com/jimbobhickville/1f10b3508ef946eccb92

Thanks in advance for any suggestions.

Greg



Re: SPARK_SUBMIT_CLASSPATH question

2014-10-15 Thread Greg Hill
I guess I was a little light on the details in my haste.  I'm using Spark on 
YARN, and this is in the driver process in yarn-client mode (most notably 
spark-shell).  I've had to manually add a bunch of JARs that I had thought it 
would just pick up like everything else does:

export 
SPARK_SUBMIT_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop/lib/native/Linux-amd64-64:$SPARK_SUBMIT_LIBRARY_PATH
export 
SPARK_SUBMIT_CLASSPATH=/usr/lib/hadoop/lib/hadoop-openstack-2.4.0.jar:/usr/lib/hadoop/lib/jackson-core-asl-1.8.8.jar:/usr/lib/spark-yarn/lib/datanucleus-api-jdo-3.2.6.jar:/usr/lib/spark-yarn/lib/datanucleus-core-3.2.10.jar:/usr/lib/spark-yarn/lib/datanucleus-rdbms-3.2.9.jar:/usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar:$SPARK_SUBMIT_CLASSPATH

The lzo jar and the SPARK_SUBMIT_LIBRARY_PATH were required to get anything at 
all to work.  Without them, basic communication failed because it couldn't load 
the lzo library to compress/decompress the data.  The datanucleus stuff was 
required for hive on spark, and the hadoop-openstack and jackson jars are for 
the swiftfs hdfs plugin to work from within spark-shell.

I tried stuff like:

export SPARK_SUBMIT_CLASSPATH=/usr/lib/hadoop/lib/*

But that didn't work at all.  I have to specify every individual jar like that.

Is there something I'm missing or some easier way to accomplish this?  I'm 
worried that I'll keep finding more missing dependencies as we explore other 
features and the classpath string is going to take up a whole screen.

Greg

From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Tuesday, October 14, 2014 1:57 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: SPARK_SUBMIT_CLASSPATH question

It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as 
other tools to put wildcards in the paths you add.  For some reason it doesn't 
pick up the classpath information from yarn-site.xml either, it seems, when 
running on YARN.  I'm having to manually add every single dependency JAR.  
There must be a better way, so what am I missing?

Greg



SPARK_SUBMIT_CLASSPATH question

2014-10-14 Thread Greg Hill
It seems to me that SPARK_SUBMIT_CLASSPATH does not follow the same ability as 
other tools to put wildcards in the paths you add.  For some reason it doesn't 
pick up the classpath information from yarn-site.xml either, it seems, when 
running on YARN.  I'm having to manually add every single dependency JAR.  
There must be a better way, so what am I missing?

Greg



Re: Spark on YARN driver memory allocation bug?

2014-10-09 Thread Greg Hill
$MASTER is 'yarn-cluster' in spark-env.sh

spark-submit --driver-memory 12424m --class org.apache.spark.examples.SparkPi 
/usr/lib/spark-yarn/lib/spark-examples*.jar 1000
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0006fd28, 
4342677504, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 4342677504 bytes for 
committing reserved memory.
# An error report file with more information is saved as:
# /tmp/jvm-3525/hs_error.log


From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Wednesday, October 8, 2014 3:25 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark on YARN driver memory allocation bug?

Hi Greg,

It does seem like a bug. What is the particular exception message that you see?

Andrew

2014-10-08 12:12 GMT-07:00 Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com:
So, I think this is a bug, but I wanted to get some feedback before I reported 
it as such.  On Spark on YARN, 1.1.0, if you specify the --driver-memory value 
to be higher than the memory available on the client machine, Spark errors out 
due to failing to allocate enough memory.  This happens even in yarn-cluster 
mode.  Shouldn't it only allocate that memory on the YARN node that is going to 
run the driver process, not the local client machine?

Greg




Spark on YARN driver memory allocation bug?

2014-10-08 Thread Greg Hill
So, I think this is a bug, but I wanted to get some feedback before I reported 
it as such.  On Spark on YARN, 1.1.0, if you specify the --driver-memory value 
to be higher than the memory available on the client machine, Spark errors out 
due to failing to allocate enough memory.  This happens even in yarn-cluster 
mode.  Shouldn't it only allocate that memory on the YARN node that is going to 
run the driver process, not the local client machine?

Greg



Re: Spark with YARN

2014-09-24 Thread Greg Hill
Do you have YARN_CONF_DIR set in your environment to point Spark to where your 
yarn configs are?

Greg

From: Raghuveer Chanda 
raghuveer.cha...@gmail.commailto:raghuveer.cha...@gmail.com
Date: Wednesday, September 24, 2014 12:25 PM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: Spark with YARN

Hi,

I'm new to spark and facing problem with running a job in cluster using YARN.

Initially i ran jobs using spark master as --master spark://dml2:7077 and it is 
running fine on 3 workers.

But now im shifting to YARN, so installed YARN in cloud era on 3 node cluster 
and changed the master to yarn-cluster but it is not working I attached the 
screenshots of UI which are not progressing and just hanging on.

Output on terminal :

This error is repeating

./spark-submit --class class-name --master yarn-cluster --num-executors 3 
--executor-cores 3  jar-with-dependencies.jar


Do i need to configure YARN or why it is not getting all the workers .. please 
help ...


14/09/24 22:44:21 INFO yarn.Client: Application report from ASM:
application identifier: application_1411578463780_0001
appId: 1
clientToAMToken: null
appDiagnostics:
appMasterHost: dml3
appQueue: root.chanda
appMasterRpcPort: 0
appStartTime: 1411578513545
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://dml2:8088/proxy/application_1411578463780_0001/
appUser: chanda
14/09/24 22:44:22 INFO yarn.Client: Application report from ASM:
application identifier: application_1411578463780_0001
appId: 1
clientToAMToken: null
appDiagnostics:
appMasterHost: dml3
appQueue: root.chanda
appMasterRpcPort: 0
appStartTime: 1411578513545
yarnAppState: RUNNING
distributedFinalState: UNDEFINED
appTrackingUrl: http://dml2:8088/proxy/application_1411578463780_0001/




--
Regards,
Raghuveer Chanda
4th year Undergraduate Student
Computer Science and Engineering
IIT Kharagpur


Re: clarification for some spark on yarn configuration options

2014-09-23 Thread Greg Hill
Thanks for looking into it.  I'm trying to avoid making the user pass in any 
parameters by configuring it to use the right values for the cluster size by 
default, hence my reliance on the configuration.  I'd rather just use 
spark-defaults.conf than the environment variables, and looking at the code you 
modified, I don't see any place it's picking up spark.driver.memory either.  Is 
that a separate bug?

Greg


From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Monday, September 22, 2014 8:11 PM
To: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com
Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

From browsing the code quickly I believe SPARK_DRIVER_MEMORY is not actually 
picked up in cluster mode. This is a bug and I have opened a PR to fix it: 
https://github.com/apache/spark/pull/2500.
For now, please use --driver-memory instead, which should work for both client 
and cluster mode.

Thanks for pointing this out,
-Andrew

2014-09-22 14:04 GMT-07:00 Nishkam Ravi 
nr...@cloudera.commailto:nr...@cloudera.com:
Maybe try --driver-memory if you are using spark-submit?

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:41 PM, Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote:
Ah, I see.  It turns out that my problem is that that comparison is ignoring 
SPARK_DRIVER_MEMORY and comparing to the default of 512m.  Is that a bug that's 
since fixed?  I'm on 1.0.1 and using 'yarn-cluster' as the master.  
'yarn-client' seems to pick up the values and works fine.

Greg

From: Nishkam Ravi nr...@cloudera.commailto:nr...@cloudera.com
Date: Monday, September 22, 2014 3:30 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: Andrew Or and...@databricks.commailto:and...@databricks.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org

Subject: Re: clarification for some spark on yarn configuration options

Greg, if you look carefully, the code is enforcing that the memoryOverhead be 
lower (and not higher) than spark.driver.memory.

Thanks,
Nishkam

On Mon, Sep 22, 2014 at 1:26 PM, Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com wrote:
I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent spark.executor.instances is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to 
spark.executor.memory for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this overhead (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set spark.driver.memory and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
--driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the --driver-class-path to Spark submit. If you are using Spark 
1.1+, you

recommended values for spark driver memory?

2014-09-23 Thread Greg Hill
I know the recommendation is it depends, but can people share what sort of 
memory allocations they're using for their driver processes?  I'd like to get 
an idea of what the range looks like so we can provide sensible defaults 
without necessarily knowing what the jobs will look like.  The customer can 
then tweak that if they need to for their particular job.

Thanks in advance.

Greg



Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent spark.executor.instances is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to 
spark.executor.memory for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this overhead (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set spark.driver.memory and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
--driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the --driver-class-path to Spark submit. If you are using Spark 
1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. 
There is also an environment variable you could set (SPARK_CLASSPATH), though 
this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the 
workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, 
but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and 
spark.executor.memory ?  Same question for the 'driver' variant, but I assume 
it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use 
the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark 
interactive to pick up the yarn class path?  The ones that work for spark-shell 
and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg



Re: clarification for some spark on yarn configuration options

2014-09-22 Thread Greg Hill
Gah, ignore me again.  I was reading the logic backwards.  For some reason it 
isn't picking up my SPARK_DRIVER_MEMORY environment variable and is using the 
default of 512m.  Probably an environmental issue.

Greg

From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Monday, September 22, 2014 3:26 PM
To: Andrew Or and...@databricks.commailto:and...@databricks.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

I thought I had this all figured out, but I'm getting some weird errors now 
that I'm attempting to deploy this on production-size servers.  It's 
complaining that I'm not allocating enough memory to the memoryOverhead values. 
 I tracked it down to this code:

https://github.com/apache/spark/blob/ed1980ffa9ccb87d76694ba910ef22df034bca49/yarn/common/src/main/scala/org/apache/spark/deploy/yarn/ClientBase.scala#L70

Unless I'm reading it wrong, those checks are enforcing that you set 
spark.yarn.driver.memoryOverhead to be higher than spark.driver.memory, but 
that makes no sense to me since that memory is just supposed to be what YARN 
needs on top of what you're allocating for Spark.  My understanding was that 
the overhead values should be quite a bit lower (and by default they are).

Also, why must the executor be allocated less memory than the driver's memory 
overhead value?

What am I misunderstanding here?

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 9, 2014 5:49 PM
To: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: clarification for some spark on yarn configuration options

Hi Greg,

SPARK_EXECUTOR_INSTANCES is the total number of workers in the cluster. The 
equivalent spark.executor.instances is just another way to set the same thing 
in your spark-defaults.conf. Maybe this should be documented. :)

spark.yarn.executor.memoryOverhead is just an additional margin added to 
spark.executor.memory for the container. In addition to the executor's 
memory, the container in which the executor is launched needs some extra memory 
for system processes, and this is what this overhead (somewhat of a misnomer) 
is for. The same goes for the driver equivalent.

spark.driver.memory behaves differently depending on which version of Spark 
you are using. If you are using Spark 1.1+ (this was released very recently), 
you can directly set spark.driver.memory and this will take effect. 
Otherwise, setting this doesn't actually do anything for client deploy mode, 
and you have two alternatives: (1) set the environment variable equivalent 
SPARK_DRIVER_MEMORY in spark-env.sh, and (2) if you are using Spark submit (or 
bin/spark-shell, or bin/pyspark, which go through bin/spark-submit), pass the 
--driver-memory command line argument.

If you want your PySpark application (driver) to pick up extra class path, you 
can pass the --driver-class-path to Spark submit. If you are using Spark 
1.1+, you may set spark.driver.extraClassPath in your spark-defaults.conf. 
There is also an environment variable you could set (SPARK_CLASSPATH), though 
this is now deprecated.

Let me know if you have more questions about these options,
-Andrew


2014-09-08 6:59 GMT-07:00 Greg Hill 
greg.h...@rackspace.commailto:greg.h...@rackspace.com:
Is SPARK_EXECUTOR_INSTANCES the total number of workers in the cluster or the 
workers per slave node?

Is spark.executor.instances an actual config option?  I found that in a commit, 
but it's not in the docs.

What is the difference between spark.yarn.executor.memoryOverhead and 
spark.executor.memory ?  Same question for the 'driver' variant, but I assume 
it's the same answer.

Is there a spark.driver.memory option that's undocumented or do you have to use 
the environment variable SPARK_DRIVER_MEMORY?

What config option or environment variable do I need to set to get pyspark 
interactive to pick up the yarn class path?  The ones that work for spark-shell 
and spark-submit don't seem to work for pyspark.

Thanks in advance.

Greg



Re: spark on yarn history server + hdfs permissions issue

2014-09-11 Thread Greg Hill
To answer my own question, in case someone else runs into this.  The spark user 
needs to be in the same group on the namenode, and hdfs caches that information 
for it seems like at least an hour.  Magically started working on its own.

Greg

From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Tuesday, September 9, 2014 2:30 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: spark on yarn history server + hdfs permissions issue

I am running Spark on Yarn with the HDP 2.1 technical preview.  I'm having 
issues getting the spark history server permissions to read the spark event 
logs from hdfs.  Both sides are configured to write/read logs from:

hdfs:///apps/spark/events

The history server is running as user spark, the jobs are running as user 
lavaqe.  Both users are in the  hdfs group on all the nodes in the cluster.

That root logs folder is globally writeable, but owned by the spark user:

drwxrwxrwx   - spark hdfs  0 2014-09-09 18:19 /apps/spark/events

All good so far.  Spark jobs create subfolders and put their event logs in 
there just fine.  The problem is that the history server, running as the spark 
user, cannot read those logs.  They're written as the user that initiates the 
job, but still in the same hdfs group:

drwxrwx---   - lavaqe hdfs  0 2014-09-09 19:24 
/apps/spark/events/spark-pi-1410290714996

The files are group readable/writable, but this is the error I get:

Permission denied: user=spark, access=READ_EXECUTE, 
inode=/apps/spark/events/spark-pi-1410290714996:lavaqe:hdfs:drwxrwx---

So, two questions, I guess:

1. Do group permissions just plain not work in hdfs or am I missing something?
2. Is there a way to tell Spark to log with more permissive permissions so the 
history server can read the generated logs?

Greg


spark on yarn history server + hdfs permissions issue

2014-09-09 Thread Greg Hill
I am running Spark on Yarn with the HDP 2.1 technical preview.  I'm having 
issues getting the spark history server permissions to read the spark event 
logs from hdfs.  Both sides are configured to write/read logs from:

hdfs:///apps/spark/events

The history server is running as user spark, the jobs are running as user 
lavaqe.  Both users are in the  hdfs group on all the nodes in the cluster.

That root logs folder is globally writeable, but owned by the spark user:

drwxrwxrwx   - spark hdfs  0 2014-09-09 18:19 /apps/spark/events

All good so far.  Spark jobs create subfolders and put their event logs in 
there just fine.  The problem is that the history server, running as the spark 
user, cannot read those logs.  They're written as the user that initiates the 
job, but still in the same hdfs group:

drwxrwx---   - lavaqe hdfs  0 2014-09-09 19:24 
/apps/spark/events/spark-pi-1410290714996

The files are group readable/writable, but this is the error I get:

Permission denied: user=spark, access=READ_EXECUTE, 
inode=/apps/spark/events/spark-pi-1410290714996:lavaqe:hdfs:drwxrwx---

So, two questions, I guess:

1. Do group permissions just plain not work in hdfs or am I missing something?
2. Is there a way to tell Spark to log with more permissive permissions so the 
history server can read the generated logs?

Greg


Re: pyspark on yarn hdp hortonworks

2014-09-05 Thread Greg Hill
I'm running into a problem getting this working as well.  I have spark-submit 
and spark-shell working fine, but pyspark in interactive mode can't seem to 
find the lzo jar:

java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not 
found

This is in /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar which is in my 
SPARK_CLASSPATH environment variable, but that doesn't seem to be picked up by 
pyspark.

Any ideas?  I can't find much in the way of docs on getting the environment 
right for pyspark.

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Wednesday, September 3, 2014 4:19 PM
To: Oleg Ruchovets oruchov...@gmail.commailto:oruchov...@gmail.com
Cc: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: pyspark on yarn hdp hortonworks

Hi Oleg,

There isn't much you need to do to setup a Yarn cluster to run PySpark. You 
need to make sure all machines have python installed, and... that's about it. 
Your assembly jar will be shipped to all containers along with all the pyspark 
and py4j files needed. One caveat, however, is that the jar needs to be built 
in maven and not on a Red Hat-based OS,

http://spark.apache.org/docs/latest/building-with-maven.html#building-for-pyspark-on-yarn

In addition, it should be built with Java 6 because of a known issue with 
building jars with Java 7 and including python files in them 
(https://issues.apache.org/jira/browse/SPARK-1718). Lastly, if you have trouble 
getting it to work, you can follow the steps I have listed in a different 
thread to figure out what's wrong:

http://mail-archives.apache.org/mod_mbox/spark-user/201406.mbox/%3ccamjob8mr1+ias-sldz_rfrke_na2uubnmhrac4nukqyqnun...@mail.gmail.com%3e

Let me know if you can get it working,
-Andrew





2014-09-03 5:03 GMT-07:00 Oleg Ruchovets 
oruchov...@gmail.commailto:oruchov...@gmail.com:
Hi all.
   I am trying to run pyspark on yarn already couple of days:

http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/

I posted exception on previous posts. It looks that I didn't do correct 
configuration.
  I googled quite a lot and I can't find the steps should be done to configure 
PySpark running on Yarn.

Can you please share the steps (critical points) should be configured to use 
PaSpark on Yarn ( hortonworks distribution) :
  Environment variables.
  Classpath
  copy jars to all machine
  other configuration.

Thanks
Oleg.




spark history server trying to hit port 8021

2014-09-03 Thread Greg Hill
My Spark history server won't start because it's trying to hit the namenode on 
8021, but the namenode is on 8020 (the default).  How can I configure the 
history server to use the right port?  I can't find any relevant setting on the 
docs: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html

Greg


Re: spark history server trying to hit port 8021

2014-09-03 Thread Greg Hill
Nevermind, PEBKAC.  I had put in the wrong port in the $LOG_DIR environment 
variable.

Greg

From: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com
Date: Wednesday, September 3, 2014 1:56 PM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: spark history server trying to hit port 8021

My Spark history server won't start because it's trying to hit the namenode on 
8021, but the namenode is on 8020 (the default).  How can I configure the 
history server to use the right port?  I can't find any relevant setting on the 
docs: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/monitoring.html

Greg


Spark on YARN question

2014-09-02 Thread Greg Hill
I'm working on setting up Spark on YARN using the HDP technical preview - 
http://hortonworks.com/kb/spark-1-0-1-technical-preview-hdp-2-1-3/

I have installed the Spark JARs on all the slave nodes and configured YARN to 
find the JARs.  It seems like everything is working.

Unless I'm misunderstanding, it seems like there isn't any configuration 
required on the YARN slave nodes at all, apart from telling YARN where to find 
the Spark JAR files.  Do the YARN processes even pick up local Spark 
configuration files on the slave nodes, or is that all just pulled in on the 
client and passed along to YARN?

Greg


Re: Spark on YARN question

2014-09-02 Thread Greg Hill
Thanks.  That sounds like how I was thinking it worked.  I did have to install 
the JARs on the slave nodes for yarn-cluster mode to work, FWIW.  It's probably 
just whichever node ends up spawning the application master that needs it, but 
it wasn't passed along from spark-submit.

Greg

From: Andrew Or and...@databricks.commailto:and...@databricks.com
Date: Tuesday, September 2, 2014 11:05 AM
To: Matt Narrell matt.narr...@gmail.commailto:matt.narr...@gmail.com
Cc: Greg greg.h...@rackspace.commailto:greg.h...@rackspace.com, 
user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Re: Spark on YARN question

Hi Greg,

You should not need to even manually install Spark on each of the worker nodes 
or put it into HDFS yourself. Spark on Yarn will ship all necessary jars (i.e. 
the assembly + additional jars) to each of the containers for you. You can 
specify additional jars that your application depends on through the --jars 
argument if you are using spark-submit / spark-shell / pyspark. As for 
environment variables, you can specify SPARK_YARN_USER_ENV on the driver node 
(where your application is submitted) to specify environment variables to be 
observed by your executors. If you are using the spark-submit / spark-shell / 
pyspark scripts, then you can set Spark properties in the 
conf/spark-defaults.conf properties file, and these will be propagated to the 
executors. In other words, configurations on the slave nodes don't do anything.

For example,
$ vim conf/spark-defaults.conf // set a few properties
$ export SPARK_YARN_USER_ENV=YARN_LOCAL_DIR=/mnt,/mnt2
$ bin/spark-shell --master yarn --jars /local/path/to/my/jar1,/another/jar2

Best,
-Andrew