Hello, I have a PySpark computation that relies on Pandas and NumPy. Currently, my inner loop iterates 2,000 times. I’m seeing the following show up in my profiling:
74804/29102 0.204 0.000 2.173 0.000 <frozen importlib._bootstrap>:2234(_find_and_load) 74804/29102 0.145 0.000 1.867 0.000 <frozen importlib._bootstrap>:2207(_find_and_load_unlocked) 45704/29102 0.021 0.000 1.820 0.000 <frozen importlib._bootstrap>:313(_call_with_frames_removed) 45702/29100 0.048 0.000 1.793 0.000 {built-in method __import__} That is, there are over 10 apparently import-related calls for each iteration of my inner loop. Commenting out the content of my loop removes most of the calls, and the number of them seems to scale with the number of inner loop iterations, so I’m pretty sure these calls are indeed coming from there. Further examination of the profile shows that the callers of these functions are inside Pandas, e.g. tseries.period.__getitem__(), which reads as follows: def __getitem__(self, key): getitem = self._data.__getitem__ if np.isscalar(key): val = getitem(key) return Period(ordinal=val, freq=self.freq) else: if com.is<http://com.is>_bool_indexer(key): key = np.asarray(key) result = getitem(key) if result.ndim > 1: # MPL kludge # values = np.asarray(list(values), dtype=object) # return values.reshape(result.shape) return PeriodIndex(result, name=self.name, freq=self.freq) return PeriodIndex(result, name=self.name, freq=self.freq) Note that there are not import statements here or calls to the functions above. My guess is that somehow PySpark’s pickle stuff is inserting them, e.g., around the self._data access. This is single-node testing currently. At this scale, about 1/3 of the time is spent in these import functions. Pandas and other modules are available on all workers either via the virtualenv or PYTHONPATH. I am not using --py-files. Since the inner loop is performance-critical, I can’t have imports happening there. My question is, why are these import functions being called and how can I avoid them? Thanks for any help. Reid