Hi all!

I found this problem when I tried running python application on Amazon's
EMR yarn cluster.

It is possible to run bundled example applications on EMR but I cannot
figure out how to run a little bit more complex python application which
depends on some other python scripts. I tried adding those files with
'--py-files' and it works fine in local mode but it fails and gives me
following message when run in EMR:
"Error: Only local python files are supported:
s3://pathtomybucket/mylibrary.py".

Simplest way to reproduce in local:
bin/spark-submit --py-files s3://whatever.path.com/library.py main.py

Actual commands to run it in EMR
#launch cluster
aws emr create-cluster --name SparkCluster --ami-version 3.3.1
--instance-type m1.medium --instance-count 2  --ec2-attributes
KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
--enable-debugging --use-default-roles  --bootstrap-action
Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s","
http://pathtomybucket/bootstrap-actions/spark
","-l","WARN","-v","1.2","-b","2014121700","-x"]
#{
#   "ClusterId": "j-2Y58DME79MPQJ"
#}

#run application
aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps
ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
#{
#    "StepIds": [
#        "s-2UP4PP75YX0KU"
#    ]
#}
And in stderr of that step I get "Error: Only local python files are
supported: s3://pathtomybucket/tasks/demo/main.py".

What is the workaround or correct way to do it? Using hadoop's distcp to
copy dependency files from s3 to nodes as another pre-step?

Regards, Vladimir

Reply via email to