On Sep 3, 2015, at 12:39 PM, Davies Liu 
<dav...@databricks.com<mailto:dav...@databricks.com>> wrote:

I think this is not a problem of PySpark, you also saw this if you
profile this script:

```
list(map(map_, range(sc.defaultParallelism)))
```

81777/80874    0.086    0.000    0.360    0.000 <frozen
importlib._bootstrap>:2264(_handle_fromlist)

Thanks. Yes, I think you’re right; they seem to be coming from Pandas. Plain 
NumPy calculations do not generate the numerous import-related calls.

That said, I’m still not sure why the time consumed in my real program is so 
much more (~20% rather than ~1%). I will see if I can figure out a better test 
program, or maybe try a different approach.

Reid

Reply via email to