PySpark Pickle reading does not find module

Fabian Böhnlein Tue, 23 Feb 2016 01:45:30 -0800

Hi all,

how can I make a module/class visible to a sc.pickleFile? It seems tomiss it in the env after an import in the driver PySpark context.

The module is available for writing, but reading in a new SparkContextthan the one that wrote, fails. The imports are the same in both. Anyideas how I can point it to there apart of the global import?


How I create it:

from scipy.sparse import csr, csr_matrix import numpy as np def get_csr(y):
   ...
   ..
   return csr_matrix(data, (row,col))

rdd = rdd1.map(lambda x: get_csr(x))

rdd.take(2)
[<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 62 stored elements 
in Compressed Sparse Row format>,
<1x150498 sparse matrix of type '<type 'numpy.float64'>' with 84 stored elements 
in Compressed Sparse Row format>]

rdd.saveAsPickleFile(..)

Reading in a new SparkContext causes a /No module named scipy.sparse.csr/(see below).

Loading the file in the same SparkContext where it was written, works.

The PYTHON_PATH is set on all workers to the same local anacondadistribution and the local anaconda of this particular worker whichcauses the error definitely has the module available.

File"/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",line 164, in _read_with_length return self.loads(obj) File"/usr/local/spark-1.6.0-bin-hadoop2.6/python/lib/pyspark.zip/pyspark/serializers.py",line 422, in loads return pickle.loads(obj) ImportError: No module namedscipy.sparse.csr




Thanks,
Fabian

PySpark Pickle reading does not find module

Reply via email to