Re: large number of import-related function calls in PySpark profile

2015-09-03 Thread Priedhorsky, Reid

On Sep 2, 2015, at 11:31 PM, Davies Liu 
> wrote:

Could you have a short script to reproduce this?

Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04.

import pandas as pd  # must be in default path for interpreter
import pyspark

LEN = 260
ITER_CT = 1

conf = pyspark.SparkConf()
conf.set('spark.python.profile', 'true')
sc = pyspark.SparkContext(conf=conf)

a = sc.broadcast(pd.Series(range(LEN)))

def map_(i):
   b = pd.Series(range(LEN))
   for i in range(ITER_CT):
  b.corr(a.value)
   return None

shards = sc.parallelize(range(sc.defaultParallelism), sc.defaultParallelism)
data = shards.map(map_)
data.collect()

sc.show_profiles()

Run as:

$ spark-submit --master local[4] demo.py

Here’s a profile excerpt:


Profile of 

Re: large number of import-related function calls in PySpark profile

2015-09-03 Thread Priedhorsky, Reid

On Sep 3, 2015, at 12:39 PM, Davies Liu 
> wrote:

I think this is not a problem of PySpark, you also saw this if you
profile this script:

```
list(map(map_, range(sc.defaultParallelism)))
```

81777/808740.0860.0000.3600.000 :2264(_handle_fromlist)

Thanks. Yes, I think you’re right; they seem to be coming from Pandas. Plain 
NumPy calculations do not generate the numerous import-related calls.

That said, I’m still not sure why the time consumed in my real program is so 
much more (~20% rather than ~1%). I will see if I can figure out a better test 
program, or maybe try a different approach.

Reid


large number of import-related function calls in PySpark profile

2015-09-02 Thread Priedhorsky, Reid
Hello,

I have a PySpark computation that relies on Pandas and NumPy. Currently, my 
inner loop iterates 2,000 times. I’m seeing the following show up in my 
profiling:

74804/291020.2040.0002.1730.000 :2234(_find_and_load)
74804/291020.1450.0001.8670.000 :2207(_find_and_load_unlocked)
45704/291020.0210.0001.8200.000 :313(_call_with_frames_removed)
45702/291000.0480.0001.7930.000 {built-in method __import__}

That is, there are over 10 apparently import-related calls for each iteration 
of my inner loop. Commenting out the content of my loop removes most of the 
calls, and the number of them seems to scale with the number of inner loop 
iterations, so I’m pretty sure these calls are indeed coming from there.

Further examination of the profile shows that the callers of these functions 
are inside Pandas, e.g. tseries.period.__getitem__(), which reads as follows:

def __getitem__(self, key):
getitem = self._data.__getitem__
if np.isscalar(key):
val = getitem(key)
return Period(ordinal=val, freq=self.freq)
else:
if com.is_bool_indexer(key):
key = np.asarray(key)

result = getitem(key)
if result.ndim > 1:
# MPL kludge
# values = np.asarray(list(values), dtype=object)
# return values.reshape(result.shape)

return PeriodIndex(result, name=self.name, freq=self.freq)

return PeriodIndex(result, name=self.name, freq=self.freq)

Note that there are not import statements here or calls to the functions above. 
My guess is that somehow PySpark’s pickle stuff is inserting them, e.g., around 
the self._data access.

This is single-node testing currently. At this scale, about 1/3 of the time is 
spent in these import functions.

Pandas and other modules are available on all workers either via the virtualenv 
or PYTHONPATH. I am not using --py-files.

Since the inner loop is performance-critical, I can’t have imports happening 
there. My question is, why are these import functions being called and how can 
I avoid them?

Thanks for any help.

Reid