I managed to put together a first Spark application on top of my existing codebase.
But I am still puzzled by the best way to deploy legacy Python code. Can't I just put my codebase in some directory on the slave machines? Existing solutions: 1. Rewrite everything in terms of Spark primitives (map, reduce, filter, count, etc.) But I have a large legacy codebase, and I need to call some coarse-grained functionity inside it. Rewriting would take too much development time. Also, complex functionality is most easily expressed, and runs fastest, as ordinary code rather than as a combination of hundreds of Spark primitives. 2. Bundle it as a zip and use the pyFiles parameter on SparkContext. But some of the code loads resources from its own code-path, and so ordinary file access fails when the source code is in a zip file. In any case, it seems inefficient to transmit a large codebase on distributed invocations. 3. Put all the code in Python's site-packages directory. But that directory is more suited to pip packages than my own code. 4. Set the PYTHONPATH on the slave. But the Python worker code does not seem to pick this up. What is the best approach to large legacy Python codebases? Thanks, Joshua