Thank you Andrew for you reply!

I am very intested in having this feature. It is possible to run PySpark on
AWS EMR in client mode(https://aws.amazon.com/articles/4926593393724923),
but that kills the whole idea of running batch jobs in EMR on PySpark.

Could you please (help to) create a task(with some details of possible
implementation) for this feature? I'd like to implement that but I'm too
new to Spark to know how to do it in a good way...

-Vladimir

On Tue, Jan 20, 2015 at 8:40 PM, Andrew Or <and...@databricks.com> wrote:

> Hi Vladimir,
>
> Yes, as the error messages suggests, PySpark currently only supports local
> files. This does not mean it only runs in local mode, however; you can
> still run PySpark on any cluster manager (though only in client mode). All
> this means is that your python files must be on your local file system.
> Until this is supported, the straightforward workaround then is to just
> copy the files to your local machine.
>
> -Andrew
>
> 2015-01-20 7:38 GMT-08:00 Vladimir Grigor <vladi...@kiosked.com>:
>
> Hi all!
>>
>> I found this problem when I tried running python application on Amazon's
>> EMR yarn cluster.
>>
>> It is possible to run bundled example applications on EMR but I cannot
>> figure out how to run a little bit more complex python application which
>> depends on some other python scripts. I tried adding those files with
>> '--py-files' and it works fine in local mode but it fails and gives me
>> following message when run in EMR:
>> "Error: Only local python files are supported:
>> s3://pathtomybucket/mylibrary.py".
>>
>> Simplest way to reproduce in local:
>> bin/spark-submit --py-files s3://whatever.path.com/library.py main.py
>>
>> Actual commands to run it in EMR
>> #launch cluster
>> aws emr create-cluster --name SparkCluster --ami-version 3.3.1
>> --instance-type m1.medium --instance-count 2  --ec2-attributes
>> KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
>> --enable-debugging --use-default-roles  --bootstrap-action
>> Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s","
>> http://pathtomybucket/bootstrap-actions/spark
>> ","-l","WARN","-v","1.2","-b","2014121700","-x"]
>> #{
>> #   "ClusterId": "j-2Y58DME79MPQJ"
>> #}
>>
>> #run application
>> aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps
>> ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
>> #{
>> #    "StepIds": [
>> #        "s-2UP4PP75YX0KU"
>> #    ]
>> #}
>> And in stderr of that step I get "Error: Only local python files are
>> supported: s3://pathtomybucket/tasks/demo/main.py".
>>
>> What is the workaround or correct way to do it? Using hadoop's distcp to
>> copy dependency files from s3 to nodes as another pre-step?
>>
>> Regards, Vladimir
>>
>
>

Reply via email to