The slowness in PySpark may be related to searching path added by PySpark, could you show the sys.path?
On Thu, Sep 3, 2015 at 1:38 PM, Priedhorsky, Reid <rei...@lanl.gov> wrote: > > On Sep 3, 2015, at 12:39 PM, Davies Liu <dav...@databricks.com> wrote: > > I think this is not a problem of PySpark, you also saw this if you > profile this script: > > ``` > list(map(map_, range(sc.defaultParallelism))) > ``` > > 81777/80874 0.086 0.000 0.360 0.000 <frozen > importlib._bootstrap>:2264(_handle_fromlist) > > > Thanks. Yes, I think you’re right; they seem to be coming from Pandas. Plain > NumPy calculations do not generate the numerous import-related calls. > > That said, I’m still not sure why the time consumed in my real program is so > much more (~20% rather than ~1%). I will see if I can figure out a better > test program, or maybe try a different approach. > > Reid --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org