I think this is not a problem of PySpark, you also saw this if you profile this script:
``` list(map(map_, range(sc.defaultParallelism))) ``` 81777/80874 0.086 0.000 0.360 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist) On Thu, Sep 3, 2015 at 11:16 AM, Priedhorsky, Reid <rei...@lanl.gov> wrote: > > On Sep 2, 2015, at 11:31 PM, Davies Liu <dav...@databricks.com> wrote: > > Could you have a short script to reproduce this? > > > Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04. > > import pandas as pd # must be in default path for interpreter > import pyspark > > LEN = 260 > ITER_CT = 10000 > > conf = pyspark.SparkConf() > conf.set('spark.python.profile', 'true') > sc = pyspark.SparkContext(conf=conf) > > a = sc.broadcast(pd.Series(range(LEN))) > > def map_(i): > b = pd.Series(range(LEN)) > for i in range(ITER_CT): > b.corr(a.value) > return None > > shards = sc.parallelize(range(sc.defaultParallelism), sc.defaultParallelism) > data = shards.map(map_) > data.collect() > > sc.show_profiles() > > > Run as: > > $ spark-submit --master local[4] demo.py > > > Here’s a profile excerpt: > > ============================================================ > Profile of RDD<id=1> > ============================================================ > 7160748 function calls (7040732 primitive calls) in 7.075 seconds > > Ordered by: internal time, cumulative time > > ncalls tottime percall cumtime percall filename:lineno(function) > 40000 0.551 0.000 1.908 0.000 function_base.py:1780(cov) > 120000 0.527 0.000 0.527 0.000 {method 'reduce' of > 'numpy.ufunc' objects} > 440020 0.493 0.000 0.493 0.000 {built-in method array} > [...] > 80012 0.069 0.000 0.095 0.000 <frozen > importlib._bootstrap>:2264(_handle_fromlist) > > > The import-related stuff is not quite the same: different function, and the > total amount of time it’s using is less. However, the number of calls still > seems quite high (80,000, which is 10× the number of loop iterations). > > Much appreciate your help. Let me know what other information would be > useful. > > Thanks, > Reid --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org