Re: Using third party libraries in pyspark

Felix C Thu, 22 Jan 2015 22:56:04 -0800

Python couldn't find your module. Do you have that on each worker node? You 
will need to have that on each one


--- Original Message ---

From: "Davies Liu" <dav...@databricks.com>
Sent: January 22, 2015 9:12 PM
To: "Mohit Singh" <mohit1...@gmail.com>
Cc: user@spark.apache.org
Subject: Re: Using third party libraries in pyspark

You need to install these libraries on all the slaves, or submit via
spark-submit:

spark-submit --py-files  xxx

On Thu, Jan 22, 2015 at 11:23 AM, Mohit Singh <mohit1...@gmail.com> wrote:
> Hi,
>   I might be asking something very trivial, but whats the recommend way of
> using third party libraries.
> I am using tables to read hdf5 format file..
> And here is the error trace:
>
>
>     print rdd.take(2)
>   File "/tmp/spark/python/pyspark/rdd.py", line 1111, in take
>     res = self.context.runJob(self, takeUpToNumLeft, p, True)
>   File "/tmp/spark/python/pyspark/context.py", line 818, in runJob
>     it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd,
> javaPartitions, allowLocal)
>   File "/tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File "/tmp/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line
> 300, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0
> (TID 3, srv-108-23.720.rdio): org.apache.spark.api.python.PythonException:
> Traceback (most recent call last):
>   File
> "/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/worker.py",
> line 90, in main
>     command = pickleSer._read_with_length(infile)
>   File
> "/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 151, in _read_with_length
>     return self.loads(obj)
>   File
> "/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/serializers.py",
> line 396, in loads
>     return cPickle.loads(obj)
>   File
> "/hadoop/disk3/mapred/local/filecache/540/spark-assembly-1.2.0-hadoop2.3.0-cdh5.1.0.jar/pyspark/cloudpickle.py",
> line 825, in subimport
>     __import__(name)
> ImportError: ('No module named tables', <function subimport at 0x47e1398>,
> ('tables',))
>
> Though, import tables works fine on the local python shell.. but seems like
> every thing is being pickled.. Are we expected to send all the files as
> helper files? that doesn't seems right?
> Thanks
>
>
> --
>
>
> Mohit
>
> "When you want success as badly as you want the air, then you will get it.
> There is no other secret of success."
> -Socrates

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Using third party libraries in pyspark

Reply via email to