Thank you Andrew for you reply! I am very intested in having this feature. It is possible to run PySpark on AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923), but that kills the whole idea of running batch jobs in EMR on PySpark.
Could you please (help to) create a task(with some details of possible implementation) for this feature? I'd like to implement that but I'm too new to Spark to know how to do it in a good way... -Vladimir On Tue, Jan 20, 2015 at 8:40 PM, Andrew Or <and...@databricks.com> wrote: > Hi Vladimir, > > Yes, as the error messages suggests, PySpark currently only supports local > files. This does not mean it only runs in local mode, however; you can > still run PySpark on any cluster manager (though only in client mode). All > this means is that your python files must be on your local file system. > Until this is supported, the straightforward workaround then is to just > copy the files to your local machine. > > -Andrew > > 2015-01-20 7:38 GMT-08:00 Vladimir Grigor <vladi...@kiosked.com>: > > Hi all! >> >> I found this problem when I tried running python application on Amazon's >> EMR yarn cluster. >> >> It is possible to run bundled example applications on EMR but I cannot >> figure out how to run a little bit more complex python application which >> depends on some other python scripts. I tried adding those files with >> '--py-files' and it works fine in local mode but it fails and gives me >> following message when run in EMR: >> "Error: Only local python files are supported: >> s3://pathtomybucket/mylibrary.py". >> >> Simplest way to reproduce in local: >> bin/spark-submit --py-files s3://whatever.path.com/library.py main.py >> >> Actual commands to run it in EMR >> #launch cluster >> aws emr create-cluster --name SparkCluster --ami-version 3.3.1 >> --instance-type m1.medium --instance-count 2 --ec2-attributes >> KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs >> --enable-debugging --use-default-roles --bootstrap-action >> Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s"," >> http://pathtomybucket/bootstrap-actions/spark >> ","-l","WARN","-v","1.2","-b","2014121700","-x"] >> #{ >> # "ClusterId": "j-2Y58DME79MPQJ" >> #} >> >> #run application >> aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps >> ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py] >> #{ >> # "StepIds": [ >> # "s-2UP4PP75YX0KU" >> # ] >> #} >> And in stderr of that step I get "Error: Only local python files are >> supported: s3://pathtomybucket/tasks/demo/main.py". >> >> What is the workaround or correct way to do it? Using hadoop's distcp to >> copy dependency files from s3 to nodes as another pre-step? >> >> Regards, Vladimir >> > >