Could you have a short script to reproduce this? On Wed, Sep 2, 2015 at 2:10 PM, Priedhorsky, Reid <rei...@lanl.gov> wrote: > Hello, > > I have a PySpark computation that relies on Pandas and NumPy. Currently, my > inner loop iterates 2,000 times. I’m seeing the following show up in my > profiling: > > 74804/29102 0.204 0.000 2.173 0.000 <frozen > importlib._bootstrap>:2234(_find_and_load) > 74804/29102 0.145 0.000 1.867 0.000 <frozen > importlib._bootstrap>:2207(_find_and_load_unlocked) > 45704/29102 0.021 0.000 1.820 0.000 <frozen > importlib._bootstrap>:313(_call_with_frames_removed) > 45702/29100 0.048 0.000 1.793 0.000 {built-in method __import__} > > > That is, there are over 10 apparently import-related calls for each > iteration of my inner loop. Commenting out the content of my loop removes > most of the calls, and the number of them seems to scale with the number of > inner loop iterations, so I’m pretty sure these calls are indeed coming from > there. > > Further examination of the profile shows that the callers of these functions > are inside Pandas, e.g. tseries.period.__getitem__(), which reads as > follows: > > def __getitem__(self, key): > getitem = self._data.__getitem__ > if np.isscalar(key): > val = getitem(key) > return Period(ordinal=val, freq=self.freq) > else: > if com.is_bool_indexer(key): > key = np.asarray(key) > > result = getitem(key) > if result.ndim > 1: > # MPL kludge > # values = np.asarray(list(values), dtype=object) > # return values.reshape(result.shape) > > return PeriodIndex(result, name=self.name, freq=self.freq) > > return PeriodIndex(result, name=self.name, freq=self.freq) > > > Note that there are not import statements here or calls to the functions > above. My guess is that somehow PySpark’s pickle stuff is inserting them, > e.g., around the self._data access. > > This is single-node testing currently. At this scale, about 1/3 of the time > is spent in these import functions. > > Pandas and other modules are available on all workers either via the > virtualenv or PYTHONPATH. I am not using --py-files. > > Since the inner loop is performance-critical, I can’t have imports happening > there. My question is, why are these import functions being called and how > can I avoid them? > > Thanks for any help. > > Reid >
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org