Could you have a short script to reproduce this?

On Wed, Sep 2, 2015 at 2:10 PM, Priedhorsky, Reid <rei...@lanl.gov> wrote:
> Hello,
>
> I have a PySpark computation that relies on Pandas and NumPy. Currently, my
> inner loop iterates 2,000 times. I’m seeing the following show up in my
> profiling:
>
> 74804/29102    0.204    0.000    2.173    0.000 <frozen
> importlib._bootstrap>:2234(_find_and_load)
> 74804/29102    0.145    0.000    1.867    0.000 <frozen
> importlib._bootstrap>:2207(_find_and_load_unlocked)
> 45704/29102    0.021    0.000    1.820    0.000 <frozen
> importlib._bootstrap>:313(_call_with_frames_removed)
> 45702/29100    0.048    0.000    1.793    0.000 {built-in method __import__}
>
>
> That is, there are over 10 apparently import-related calls for each
> iteration of my inner loop. Commenting out the content of my loop removes
> most of the calls, and the number of them seems to scale with the number of
> inner loop iterations, so I’m pretty sure these calls are indeed coming from
> there.
>
> Further examination of the profile shows that the callers of these functions
> are inside Pandas, e.g. tseries.period.__getitem__(), which reads as
> follows:
>
>     def __getitem__(self, key):
>         getitem = self._data.__getitem__
>         if np.isscalar(key):
>             val = getitem(key)
>             return Period(ordinal=val, freq=self.freq)
>         else:
>             if com.is_bool_indexer(key):
>                 key = np.asarray(key)
>
>             result = getitem(key)
>             if result.ndim > 1:
>                 # MPL kludge
>                 # values = np.asarray(list(values), dtype=object)
>                 # return values.reshape(result.shape)
>
>                 return PeriodIndex(result, name=self.name, freq=self.freq)
>
>             return PeriodIndex(result, name=self.name, freq=self.freq)
>
>
> Note that there are not import statements here or calls to the functions
> above. My guess is that somehow PySpark’s pickle stuff is inserting them,
> e.g., around the self._data access.
>
> This is single-node testing currently. At this scale, about 1/3 of the time
> is spent in these import functions.
>
> Pandas and other modules are available on all workers either via the
> virtualenv or PYTHONPATH. I am not using --py-files.
>
> Since the inner loop is performance-critical, I can’t have imports happening
> there. My question is, why are these import functions being called and how
> can I avoid them?
>
> Thanks for any help.
>
> Reid
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to