GitHub user mattf opened a pull request: https://github.com/apache/spark/pull/2313
[SPARK-927] detect numpy at time of use it is possible for numpy to be installed on the driver node but not on worker nodes. in such a case, using the rddsampler's constructor to detect numpy leads to failures on workers as they cannot import numpy (see below). the solution here is to detect numpy right before it is used on the workers. example code & error - yum install -y numpy pyspark >>> sc.parallelize(range(10000)).sample(False, .1).collect() ... 14/09/07 10:50:01 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, node4): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/test/spark/dist/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/home/test/spark/dist/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/home/test/spark/dist/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/home/test/spark/dist/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 115, in func if self.getUniformSample(split) <= self._fraction: File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 57, in getUniformSample self.initRandomGenerator(split) File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 42, in initRandomGenerator import numpy ImportError: No module named numpy org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) You can merge this pull request into a Git repository by running: $ git pull https://github.com/mattf/spark SPARK-927 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2313.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2313 ---- commit 7e8a11ce99bdb32bc86de96497036b731bd41e47 Author: Matthew Farrellee <m...@redhat.com> Date: 2014-09-07T15:26:55Z [SPARK-927] detect numpy at time of use it is possible for numpy to be installed on the driver node but not on worker nodes. in such a case, using the rddsampler's constructor to detect numpy leads to failures on workers as they cannot import numpy (see below). the solution here is to detect numpy right before it is used on the workers. example code & error - yum install -y numpy pyspark >>> sc.parallelize(range(10000)).sample(False, .1).collect() ... 14/09/07 10:50:01 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, node4): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/test/spark/dist/python/pyspark/worker.py", line 79, in main serializer.dump_stream(func(split_index, iterator), outfile) File "/home/test/spark/dist/python/pyspark/serializers.py", line 196, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/home/test/spark/dist/python/pyspark/serializers.py", line 127, in dump_stream for obj in iterator: File "/home/test/spark/dist/python/pyspark/serializers.py", line 185, in _batched for item in iterator: File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 115, in func if self.getUniformSample(split) <= self._fraction: File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 57, in getUniformSample self.initRandomGenerator(split) File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 42, in initRandomGenerator import numpy ImportError: No module named numpy org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org