I managed to put together a first  Spark application on top of my existing
codebase.

But I am still puzzled by the best way to deploy legacy Python code.

Can't I just put my codebase in some directory on the slave machines?

Existing solutions:

1.  Rewrite everything in terms of Spark primitives (map,
reduce, filter, count, etc.) But I have a large legacy codebase, and I
 need to call some coarse-grained functionity inside it. Rewriting would
take too much development time. Also, complex functionality  is most easily
 expressed, and runs fastest,  as ordinary code rather than as a
 combination of hundreds of Spark primitives.

2. Bundle it as a zip and use the pyFiles parameter on SparkContext. But
some of the code loads resources from its own code-path, and so ordinary
file access fails when the source code is in a zip file. In any case, it
seems inefficient to  transmit a large codebase on distributed invocations.

3. Put all the code in Python's site-packages directory. But that directory
is more suited to pip packages than my own code.

4. Set the PYTHONPATH on the slave. But the Python worker code does not
seem to pick this up.

What is the best approach to large legacy Python codebases?

Thanks,

Joshua

Reply via email to