a quick thought on this: I think this is distro dependent also, right?
We ran into a similar issue in
https://issues.apache.org/jira/browse/BIGTOP-1546 where it looked like the
python libraries might be overwritten on launch.
On Tue, Nov 25, 2014 at 3:09 PM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I have written few datastructures as classes like following..
So, here is my code structure:
project/foo/foo.py , __init__.py
/bar/bar.py, __init__.py bar.py imports foo as from foo.foo
import *
/execute/execute.py imports bar as from bar.bar import *
Ultimately I am executing execute.py as
pyspark execute.py
And this works fine locally.. but as soon I submit it on cluster... I see
modules missing error..
I tried to send each and every file using --py-files flag (foo.py bar.py )
and other helper files..
But even then it complaints that module is not found So, the question
is.. When one is building a library which is suppose to execute on top of
spark, how should the imports and library be structured so that it works
fine on spark.
When to use pyspark and when to use spark submit to execute python
scripts/module
Bonus points if one can point an example library and how to run it :)
Thanks
--
jay vyas