I have 2 Python modules/scripts - task.py and runner.py. First one (task.py) is a little Spark job and works perfectly well by itself. However, when called from runner.py with exactly the same arguments, it fails with only useless message (both - in terminal and worker logs).
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) Below there's code for both - task.py and runner.py: task.py ----------- #!/usr/bin/env pyspark from __future__ import print_function from pyspark import SparkContext def process(line): return line.strip() def main(spark_master, path): sc = SparkContext(spark_master, 'My Job') rdd = sc.textFile(path) rdd = rdd.map(process) # this line causes troubles when called from runner.py count = rdd.count() print(count) if __name__ == '__main__': main('spark://spark-master-host:7077', 'hdfs://hdfs-namenode-host:8020/path/to/file.log') runner.py ------------- #!/usr/bin/env pyspark import task if __name__ == '__main__': task.main('spark://spark-master-host:7077', 'hdfs://hdfs-namenode-host:8020/path/to/file.log') ------------------------------------------------------------------------------------------- So, what's the difference between calling PySpark-enabled script directly and as Python module? What are good rules for writing multi-module Python programs with Spark? Thanks, Andrei