Re: View executor logs on YARN mode

2015-01-14 Thread Brett Meyer
You can view the logs for the particular containers on the YARN UI if you go
to the page for a specific node, and then from the Tools menu on the left,
select Local Logs.  There should be a userlogs directory which will contain
the specific application ids for each job that you run.  Inside the
directories for the applications, you can find the specific container logs.


From:  lonely Feb lonely8...@gmail.com
Date:  Wednesday, January 14, 2015 at 8:07 AM
To:  user@spark.apache.org user@spark.apache.org
Subject:  View executor logs on YARN mode

Hi all, i sadly found on YARN mode i cannot view executor logs on YARN web
UI nor on SPARK history web UI. On YARN web UI i can only view AppMaster
logs and on SPARK history web UI i can only view Application metric
information. If i want to see whether a executor is being full GC i can only
use yarn logs command. It's very unfriendly. BTW, yarn logs command
requires a option of containerID which i could not found on YARN web UI
either. I need to grep it in AppMaster log.

I just wonder how do you handle this situation?




smime.p7s
Description: S/MIME cryptographic signature


Re: Spark executors resources. Blocking?

2015-01-14 Thread Brett Meyer
I can¹t speak to Mesos solutions, but for YARN you can define queues in
which to run your jobs, and you can customize the amount of resources the
queue consumes.  When deploying your Spark job, you can specify the ‹queue
queue_name option to schedule the job to a particular queue.  Here are
some links for reference:

http://hadoop.apache.org/docs/r2.5.1/hadoop-yarn/hadoop-yarn-site/CapacitySc
heduler.html
http://lucene.472066.n3.nabble.com/Capacity-Scheduler-on-YARN-td4081242.html



From:  Luis Guerra luispelay...@gmail.com
Date:  Tuesday, January 13, 2015 at 3:19 AM
To:  Buttler, David buttl...@llnl.gov, user user@spark.apache.org
Subject:  Re: Spark executors resources. Blocking?

Thanks for your answer David,

It is as I thought then. When you write that there could be some approaches
to solve this using Yarn or Mesos, can you give any idea about this? Or
better yet, is there any site with documentation about this issue?
Currently, we are launching our jobs using Yarn, but still we do not know
how to properly schedule our jobs to have the highest utilization of our
cluster.

Best

On Tue, Jan 13, 2015 at 2:12 AM, Buttler, David buttl...@llnl.gov wrote:
 Spark has a built-in cluster manager (The spark stand-alone cluster), but it
 is not designed for multi-tenancy ­either multiple people using the system, or
 multiple tasks sharing resources.  It is a first in, first out queue of tasks
 where tasks will block until the previous tasks are finished (as you
 described).  If you want to have higher utilization of your cluster, then you
 should use either Yarn or Mesos to schedule the system.  The same issues will
 come up, but they have a much broader range of approaches that you can take to
 solve the problem.
  
 Dave
  
 From: Luis Guerra [mailto:luispelay...@gmail.com]
 Sent: Monday, January 12, 2015 8:36 AM
 To: user
 Subject: Spark executors resources. Blocking?
 
  
 
 Hello all,
 
  
 
 I have a naive question regarding how spark uses the executors in a cluster of
 machines. Imagine the scenario in which I do not know the input size of my
 data in execution A, so I set Spark to use 20 (out of my 25 nodes, for
 instance). At the same time, I also launch a second execution B, setting Spark
 to use 10 nodes for this.
 
  
 
 Assuming a huge input size for execution A, which implies an execution time of
 30 minutes for example (using all the resources), and a constant execution
 time for B of 10 minutes, then both executions will last for 40 minutes (I
 assume that B cannot be launched until 10 resources are completely available,
 when A finishes).
 
  
 
 Now, assuming a very small input size for execution A running for 5 minutes in
 only 2 of the 20 planned resources, I would like execution B to be launched at
 that time, consuming both executions only 10 minutes (and 12 resources).
 However, as execution A has set Spark to use 20 resources, execution B has to
 wait until A has finished, so the total execution time lasts for 15 minutes.
 
  
 
 Is this right? If so, how can I solve this kind of scenarios? If I am wrong,
 what would be the correct interpretation for this?
 
  
 
 Thanks in advance,
 
  
 
 Best





smime.p7s
Description: S/MIME cryptographic signature


Location of logs in local mode

2015-01-06 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored.  The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in
spark-env.sh, but there doesn¹t seem to be any log output generated in the
locations I¹ve specified there either.  I¹m just using spark-submit with
‹master local, I haven¹t run any of the standalone cluster scripts, so I¹m
not sure if there¹s something I¹m missing here as far as a default output
location for logging.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


Location of logs in local mode

2014-12-30 Thread Brett Meyer
I¹m submitting a script using spark-submit in local mode for testing, and
I¹m having trouble figuring out where the logs are stored.  The
documentation indicates that they should be in the work folder in the
directory in which Spark lives on my system, but I see no such folder there.
I¹ve set the SPARK_LOCAL_DIRS and SPARK_LOG_DIR environment variables in
spark-env.sh, but there doesn¹t seem to be any log output generated in the
locations I¹ve specified there either.  I¹m just using spark-submit with
‹master local, I haven¹t run any of the standalone cluster scripts, so I¹m
not sure if there¹s something I¹m missing here as far as a default output
location for logging.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


How to pass options to KeyConverter using PySpark

2014-12-29 Thread Brett Meyer
I¹m running PySpark on YARN, and I¹m reading in SequenceFiles for which I
have a custom KeyConverter class.  My KeyConverter needs to have some
configuration options passed to it, but I am unable to find a way to get the
options to that class without modifying the Spark source.  Is there a
currently built-in method by which this can be done?  The sequenceFile()
method does not take any arguments for the KeyConverter, and the
SparkContext doesn¹t seem to be accessible from my KeyConverter class, so
I¹m at a loss as to how to accomplish this.

Thanks,
Brett




smime.p7s
Description: S/MIME cryptographic signature


Many retries for Python job

2014-11-21 Thread Brett Meyer
I¹m running a Python script with spark-submit on top of YARN on an EMR
cluster with 30 nodes.  The script reads in approximately 3.9 TB of data
from S3, and then does some transformations and filtering, followed by some
aggregate counts.  During Stage 2 of the job, everything looks to complete
just fine with no executor failures or resubmissions, but when Stage 3
starts up, many Stage 2 tasks have to be rerun due to FetchFailure errors.
Actually, I usually see at least 3-4 retries on Stage 2 before Stage 3 can
successfully start.  The whole application eventually completes, but there
is an addition of about 1+ hour overhead for all of the retries.

I¹m trying to determine why there were FetchFailure exceptions, since
anything computed in the job that could not fit in the available memory
cache should be by default spilled to disk for further retrieval.  I also
see some java.net.ConnectException: Connection refused² and
java.io.IOException: sendMessageReliably failed without being ACK¹d errors
in the logs after a CancelledKeyException followed by a
ClosedChannelException, but I have no idea why the nodes in the EMR cluster
would suddenly stop being able to communicate.

If anyone has ideas as to why the data needs to be rerun several times in
this job, please let me know as I am fairly bewildered about this behavior.




smime.p7s
Description: S/MIME cryptographic signature


Re: Many retries for Python job

2014-11-21 Thread Brett Meyer
According to the web UI I don¹t see any executors dying during Stage 2.  I
looked over the YARN logs and didn¹t see anything suspicious, but I may not
have been looking closely enough.  Stage 2 seems to complete just fine, it¹s
just when it enters Stage 3 that the results from the previous stage seem to
be missing in many cases and result in FetchFailure errors.  I should
probably also mention that I have the spark.storage.memoryFraction set to
0.2.

From:  Sandy Ryza sandy.r...@cloudera.com
Date:  Friday, November 21, 2014 at 1:41 PM
To:  Brett Meyer brett.me...@crowdstrike.com
Cc:  user@spark.apache.org user@spark.apache.org
Subject:  Re: Many retries for Python job

Hi Brett, 

Are you noticing executors dying?  Are you able to check the YARN
NodeManager logs and see whether YARN is killing them for exceeding memory
limits?

-Sandy

On Fri, Nov 21, 2014 at 9:47 AM, Brett Meyer brett.me...@crowdstrike.com
wrote:
 I¹m running a Python script with spark-submit on top of YARN on an EMR cluster
 with 30 nodes.  The script reads in approximately 3.9 TB of data from S3, and
 then does some transformations and filtering, followed by some aggregate
 counts.  During Stage 2 of the job, everything looks to complete just fine
 with no executor failures or resubmissions, but when Stage 3 starts up, many
 Stage 2 tasks have to be rerun due to FetchFailure errors.  Actually, I
 usually see at least 3-4 retries on Stage 2 before Stage 3 can successfully
 start.  The whole application eventually completes, but there is an addition
 of about 1+ hour overhead for all of the retries.
 
 I¹m trying to determine why there were FetchFailure exceptions, since anything
 computed in the job that could not fit in the available memory cache should be
 by default spilled to disk for further retrieval.  I also see some
 java.net.ConnectException: Connection refused² and java.io.IOException:
 sendMessageReliably failed without being ACK¹d errors in the logs after a
 CancelledKeyException followed by a ClosedChannelException, but I have no idea
 why the nodes in the EMR cluster would suddenly stop being able to
 communicate.
 
 If anyone has ideas as to why the data needs to be rerun several times in this
 job, please let me know as I am fairly bewildered about this behavior.





smime.p7s
Description: S/MIME cryptographic signature


Failed jobs showing as SUCCEEDED on web UI

2014-11-11 Thread Brett Meyer
I¹m running a Python script using spark-submit on YARN in an EMR cluster,
and if I have a job that fails due to ExecutorLostFailure or if I kill the
job, it still shows up on the web UI with a FinalStatus of SUCCEEDED.  Is
this due to PySpark, or is there potentially some other issue with the job
failure status not propagating to the logs?




smime.p7s
Description: S/MIME cryptographic signature