Hello,

I have a PySpark computation that relies on Pandas and NumPy. Currently, my 
inner loop iterates 2,000 times. I’m seeing the following show up in my 
profiling:

74804/29102    0.204    0.000    2.173    0.000 <frozen 
importlib._bootstrap>:2234(_find_and_load)
74804/29102    0.145    0.000    1.867    0.000 <frozen 
importlib._bootstrap>:2207(_find_and_load_unlocked)
45704/29102    0.021    0.000    1.820    0.000 <frozen 
importlib._bootstrap>:313(_call_with_frames_removed)
45702/29100    0.048    0.000    1.793    0.000 {built-in method __import__}

That is, there are over 10 apparently import-related calls for each iteration 
of my inner loop. Commenting out the content of my loop removes most of the 
calls, and the number of them seems to scale with the number of inner loop 
iterations, so I’m pretty sure these calls are indeed coming from there.

Further examination of the profile shows that the callers of these functions 
are inside Pandas, e.g. tseries.period.__getitem__(), which reads as follows:

    def __getitem__(self, key):
        getitem = self._data.__getitem__
        if np.isscalar(key):
            val = getitem(key)
            return Period(ordinal=val, freq=self.freq)
        else:
            if com.is<http://com.is>_bool_indexer(key):
                key = np.asarray(key)

            result = getitem(key)
            if result.ndim > 1:
                # MPL kludge
                # values = np.asarray(list(values), dtype=object)
                # return values.reshape(result.shape)

                return PeriodIndex(result, name=self.name, freq=self.freq)

            return PeriodIndex(result, name=self.name, freq=self.freq)

Note that there are not import statements here or calls to the functions above. 
My guess is that somehow PySpark’s pickle stuff is inserting them, e.g., around 
the self._data access.

This is single-node testing currently. At this scale, about 1/3 of the time is 
spent in these import functions.

Pandas and other modules are available on all workers either via the virtualenv 
or PYTHONPATH. I am not using --py-files.

Since the inner loop is performance-critical, I can’t have imports happening 
there. My question is, why are these import functions being called and how can 
I avoid them?

Thanks for any help.

Reid

Reply via email to