Hi all,

how can I make a module/class visible to a sc.pickleFile? It seems to miss it in the env after an import in the driver PySpark context.

The module is available for writing, but reading in a new SparkContext than the one that wrote, fails. The imports are the same in both. Any ideas how I can point it to there apart of the global import?

How I create it:

from scipy.sparse import csr, csr_matrix import numpy as np def get_csr(y):
   ...
   ..
   return csr_matrix(data, (row,col))

rdd = rdd1.map(lambda x: get_csr(x))

rdd.take(2)
[<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 62 stored elements 
in Compressed Sparse Row format>,
<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 84 stored elements 
in Compressed Sparse Row format>]

rdd.saveAsPickleFile(..)


Reading in a new SparkContext causes a /No module named scipy.sparse.csr /(see below).
Loading the file in the same SparkContext where it was written, works.
The PYTHON_PATH is set on all workers to the same local anaconda distribution and the local anaconda of this particular worker which causes the error definitely has the module available.

File "/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length return self.loads(obj) File "/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads return pickle.loads(obj) ImportError: No module named scipy.sparse.csr



Thanks,
Fabian

Reply via email to