a quick thought on this: I think this is distro dependent also, right? We ran into a similar issue in https://issues.apache.org/jira/browse/BIGTOP-1546 where it looked like the python libraries might be overwritten on launch.
On Tue, Nov 25, 2014 at 3:09 PM, Chengi Liu <chengi.liu...@gmail.com> wrote: > Hi, > I have written few datastructures as classes like following.. > > So, here is my code structure: > > project/foo/foo.py , __init__.py > /bar/bar.py, __init__.py bar.py imports foo as from foo.foo > import * > /execute/execute.py imports bar as from bar.bar import * > > Ultimately I am executing execute.py as > > pyspark execute.py > > And this works fine locally.. but as soon I submit it on cluster... I see > modules missing error.. > I tried to send each and every file using --py-files flag (foo.py bar.py ) > and other helper files.. > > But even then it complaints that module is not found.... So, the question > is.. When one is building a library which is suppose to execute on top of > spark, how should the imports and library be structured so that it works > fine on spark. > When to use pyspark and when to use spark submit to execute python > scripts/module > Bonus points if one can point an example library and how to run it :) > Thanks > -- jay vyas