Hello fellow users, 1) I am wondering if there is documentation or guidelines to understand in what situations does Pyspark decide to pickle the functions I use in the map method. 2) Are there best practices to avoid pickling and sharing variables, etc,
I have a situation where I want to pass to the map methods, however, those methods use C++ libraries underneath and Pyspark decides to pickle the entire object and fails when trying to do that. I tried to use broadcast, the moment I turn my function to use additional parameters that must be passed through the map object spark decides to create an object and try to serialize that Now I can probably create a dummy function that just does the sharing of the variables and initialize locally. I can chain that to the map method, I think that would pretty awkward if I have to resort to that. Here is my situation in code: class Model(object): __metaclass__ = Singleton model_loaded = False mod = None @staticmethoddef load(args): # load model@staticmethod def predict(input, args): if not model_loaded: load(args) mod.predict(input) def spark_main() args = parse_args() lines = read() rdd = sc.parallelize(lines) rdd = rdd.map(lambda x: Model.predict(x, args) //*fails here with: pickle.PicklingError: Could not serialize object: TypeError: can't pickle thread.lock objects* Thanks, Naveen