On Sep 2, 2015, at 11:31 PM, Davies Liu <dav...@databricks.com<mailto:dav...@databricks.com>> wrote:
Could you have a short script to reproduce this? Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04. import pandas as pd # must be in default path for interpreter import pyspark LEN = 260 ITER_CT = 10000 conf = pyspark.SparkConf() conf.set('spark.python.profile', 'true') sc = pyspark.SparkContext(conf=conf) a = sc.broadcast(pd.Series(range(LEN))) def map_(i): b = pd.Series(range(LEN)) for i in range(ITER_CT): b.corr(a.value) return None shards = sc.parallelize(range(sc.defaultParallelism), sc.defaultParallelism) data = shards.map(map_) data.collect() sc.show_profiles() Run as: $ spark-submit --master local[4] demo.py Here’s a profile excerpt: ============================================================ Profile of RDD<id=1> ============================================================ 7160748 function calls (7040732 primitive calls) in 7.075 seconds Ordered by: internal time, cumulative time ncalls tottime percall cumtime percall filename:lineno(function) 40000 0.551 0.000 1.908 0.000 function_base.py:1780(cov) 120000 0.527 0.000 0.527 0.000 {method 'reduce' of 'numpy.ufunc' objects} 440020 0.493 0.000 0.493 0.000 {built-in method array} [...] 80012 0.069 0.000 0.095 0.000 <frozen importlib._bootstrap>:2264(_handle_fromlist) The import-related stuff is not quite the same: different function, and the total amount of time it’s using is less. However, the number of calls still seems quite high (80,000, which is 10× the number of loop iterations). Much appreciate your help. Let me know what other information would be useful. Thanks, Reid