Just an FYI - I can submit the SparkPi app to YARN in cluster mode on a 1-node m3.xlarge EC2 instance instance and the app finishes running successfully in about 40 seconds. I just figured the 30 - 40 sec run time was normal b/c of the submitting overhead that Andrew mentioned.
Denny, you can maybe also try to run SparkPi against YARN as a speed check. spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /opt/cloudera/parcels/CDH-5.2.1-1.cdh5.2.1.p0.12/jars/spark-examples-1.1.0-cdh5.2.1-hadoop2.5.0-cdh5.2.1.jar 10 On Fri, Dec 5, 2014 at 2:32 PM, Denny Lee <denny.g....@gmail.com> wrote: > My submissions of Spark on YARN (CDH 5.2) resulted in a few thousand > steps. If I was running this on standalone cluster mode the query finished > in 55s but on YARN, the query was still running 30min later. Would the hard > coded sleeps potentially be in play here? > On Fri, Dec 5, 2014 at 11:23 Sandy Ryza <sandy.r...@cloudera.com> wrote: > >> Hi Tobias, >> >> What version are you using? In some recent versions, we had a couple of >> large hardcoded sleeps on the Spark side. >> >> -Sandy >> >> On Fri, Dec 5, 2014 at 11:15 AM, Andrew Or <and...@databricks.com> wrote: >> >>> Hey Tobias, >>> >>> As you suspect, the reason why it's slow is because the resource manager >>> in YARN takes a while to grant resources. This is because YARN needs to >>> first set up the application master container, and then this AM needs to >>> request more containers for Spark executors. I think this accounts for most >>> of the overhead. The remaining source probably comes from how our own YARN >>> integration code polls application (every second) and cluster resource >>> states (every 5 seconds IIRC). I haven't explored in detail whether there >>> are optimizations there that can speed this up, but I believe most of the >>> overhead comes from YARN itself. >>> >>> In other words, no I don't know of any quick fix on your end that you >>> can do to speed this up. >>> >>> -Andrew >>> >>> >>> 2014-12-03 20:10 GMT-08:00 Tobias Pfeiffer <t...@preferred.jp>: >>> >>> Hi, >>>> >>>> I am using spark-submit to submit my application to YARN in >>>> "yarn-cluster" mode. I have both the Spark assembly jar file as well as my >>>> application jar file put in HDFS and can see from the logging output that >>>> both files are used from there. However, it still takes about 10 seconds >>>> for my application's yarnAppState to switch from ACCEPTED to RUNNING. >>>> >>>> I am aware that this is probably not a Spark issue, but some YARN >>>> configuration setting (or YARN-inherent slowness), I was just wondering if >>>> anyone has an advice for how to speed this up. >>>> >>>> Thanks >>>> Tobias >>>> >>> >>> >>