[ https://issues.apache.org/jira/browse/SPARK-5162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292069#comment-14292069 ]
Vladimir Grigor commented on SPARK-5162: ---------------------------------------- I second [~jared.holmb...@orchestro.com] [~lianhuiwang] thank you! I'm going to try your PR. Related issue Even with this PR, there will be problem using Yarn in cluster mode on Amazon EMR. Normally one submits yarn "jobs" via API or aws command line utility, so paths to files are evaluated later at some remote host, hence files are not found. Currently Spark does not support non-local files. One idea would be to add support for non-local (python) files, eg: if file is not local it will be downloaded and made available locally. Something similar to "Distributed Cache" described at http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-input-distributed-cache.html So following code would work: {code} aws emr add-steps --cluster-id "j-XYWIXMD234" \ --steps Name=SparkPi,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://mybucketat.amazonaws.com/tasks/main.py,main.py,param1],ActionOnFailure=CONTINUE {code} What do you think? What is your way to run batch python spark scripts on Yarn in Amazon? > Python yarn-cluster mode > ------------------------ > > Key: SPARK-5162 > URL: https://issues.apache.org/jira/browse/SPARK-5162 > Project: Spark > Issue Type: New Feature > Components: PySpark, YARN > Reporter: Dana Klassen > Labels: cluster, python, yarn > > Running pyspark in yarn is currently limited to ‘yarn-client’ mode. It would > be great to be able to submit python applications to the cluster and (just > like java classes) have the resource manager setup an AM on any node in the > cluster. Does anyone know the issues blocking this feature? I was snooping > around with enabling python apps: > Removing the logic stopping python and yarn-cluster from sparkSubmit.scala > ... > // The following modes are not supported or applicable > (clusterManager, deployMode) match { > ... > case (_, CLUSTER) if args.isPython => > printErrorAndExit("Cluster deploy mode is currently not supported for > python applications.") > ... > } > … > and submitting application via: > HADOOP_CONF_DIR={{insert conf dir}} ./bin/spark-submit --master yarn-cluster > --num-executors 2 —-py-files {{insert location of egg here}} > --executor-cores 1 ../tools/canary.py > Everything looks to run alright, pythonRunner is picked up as main class, > resources get setup, yarn client gets launched but falls flat on its face: > 2015-01-08 18:48:03,444 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: > DEBUG: FAILED { > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py, > 1420742868009, FILE, null }, Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py changed > on src filesystem (expected 1420742868009, was 1420742869284 > and > 2015-01-08 18:48:03,446 INFO > org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: > Resource > {{redacted}}/.sparkStaging/application_1420594669313_4687/canary.py(->/data/4/yarn/nm/usercache/klassen/filecache/11/canary.py) > transitioned from DOWNLOADING to FAILED > Tracked this down to the apache hadoop code(FSDownload.java line 249) related > to container localization of files upon downloading. At this point thought it > would be best to raise the issue here and get input. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org