GitHub user mattf opened a pull request:

    https://github.com/apache/spark/pull/2313

    [SPARK-927] detect numpy at time of use

    it is possible for numpy to be installed on the driver node but not on
    worker nodes. in such a case, using the rddsampler's constructor to
    detect numpy leads to failures on workers as they cannot import numpy
    (see below).
    
    the solution here is to detect numpy right before it is used on the
    workers.
    
    example code & error -
    
    yum install -y numpy
    pyspark
    >>> sc.parallelize(range(10000)).sample(False, .1).collect()
    ...
    14/09/07 10:50:01 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, 
node4): org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
      File "/home/test/spark/dist/python/pyspark/worker.py", line 79, in main
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/home/test/spark/dist/python/pyspark/serializers.py", line 196, in 
dump_stream
        self.serializer.dump_stream(self._batched(iterator), stream)
      File "/home/test/spark/dist/python/pyspark/serializers.py", line 127, in 
dump_stream
        for obj in iterator:
      File "/home/test/spark/dist/python/pyspark/serializers.py", line 185, in 
_batched
        for item in iterator:
      File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 115, in 
func
        if self.getUniformSample(split) <= self._fraction:
      File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 57, in 
getUniformSample
        self.initRandomGenerator(split)
      File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 42, in 
initRandomGenerator
        import numpy
    ImportError: No module named numpy
            
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
            
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
            org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
            org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
            org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
            org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
            org.apache.spark.scheduler.Task.run(Task.scala:54)
            
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
            
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            java.lang.Thread.run(Thread.java:744)

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mattf/spark SPARK-927

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2313.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2313
    
----
commit 7e8a11ce99bdb32bc86de96497036b731bd41e47
Author: Matthew Farrellee <m...@redhat.com>
Date:   2014-09-07T15:26:55Z

    [SPARK-927] detect numpy at time of use
    
    it is possible for numpy to be installed on the driver node but not on
    worker nodes. in such a case, using the rddsampler's constructor to
    detect numpy leads to failures on workers as they cannot import numpy
    (see below).
    
    the solution here is to detect numpy right before it is used on the
    workers.
    
    example code & error -
    
    yum install -y numpy
    pyspark
    >>> sc.parallelize(range(10000)).sample(False, .1).collect()
    ...
    14/09/07 10:50:01 WARN TaskSetManager: Lost task 4.0 in stage 0.0 (TID 4, 
node4): org.apache.spark.api.python.PythonException: Traceback (most recent 
call last):
      File "/home/test/spark/dist/python/pyspark/worker.py", line 79, in main
        serializer.dump_stream(func(split_index, iterator), outfile)
      File "/home/test/spark/dist/python/pyspark/serializers.py", line 196, in 
dump_stream
        self.serializer.dump_stream(self._batched(iterator), stream)
      File "/home/test/spark/dist/python/pyspark/serializers.py", line 127, in 
dump_stream
        for obj in iterator:
      File "/home/test/spark/dist/python/pyspark/serializers.py", line 185, in 
_batched
        for item in iterator:
      File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 115, in 
func
        if self.getUniformSample(split) <= self._fraction:
      File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 57, in 
getUniformSample
        self.initRandomGenerator(split)
      File "/home/test/spark/dist/python/pyspark/rddsampler.py", line 42, in 
initRandomGenerator
        import numpy
    ImportError: No module named numpy
            
org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124)
            
org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154)
            org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87)
            org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
            org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
            org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
            org.apache.spark.scheduler.Task.run(Task.scala:54)
            
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
            
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
            
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
            java.lang.Thread.run(Thread.java:744)

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to