On Sep 3, 2015, at 12:39 PM, Davies Liu <dav...@databricks.com<mailto:dav...@databricks.com>> wrote:
I think this is not a problem of PySpark, you also saw this if you profile this script: ``` list(map(map_, range(sc.defaultParallelism))) ``` 81777/80874 0.086 0.000 0.360 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist) Thanks. Yes, I think you’re right; they seem to be coming from Pandas. Plain NumPy calculations do not generate the numerous import-related calls. That said, I’m still not sure why the time consumed in my real program is so much more (~20% rather than ~1%). I will see if I can figure out a better test program, or maybe try a different approach. Reid