On Sep 2, 2015, at 11:31 PM, Davies Liu 
<dav...@databricks.com<mailto:dav...@databricks.com>> wrote:

Could you have a short script to reproduce this?

Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04.

import pandas as pd  # must be in default path for interpreter
import pyspark

LEN = 260
ITER_CT = 10000

conf = pyspark.SparkConf()
conf.set('spark.python.profile', 'true')
sc = pyspark.SparkContext(conf=conf)

a = sc.broadcast(pd.Series(range(LEN)))

def map_(i):
   b = pd.Series(range(LEN))
   for i in range(ITER_CT):
      b.corr(a.value)
   return None

shards = sc.parallelize(range(sc.defaultParallelism), sc.defaultParallelism)
data = shards.map(map_)
data.collect()

sc.show_profiles()

Run as:

$ spark-submit --master local[4] demo.py

Here’s a profile excerpt:

============================================================
Profile of RDD<id=1>
============================================================
         7160748 function calls (7040732 primitive calls) in 7.075 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    40000    0.551    0.000    1.908    0.000 function_base.py:1780(cov)
   120000    0.527    0.000    0.527    0.000 {method 'reduce' of 'numpy.ufunc' 
objects}
   440020    0.493    0.000    0.493    0.000 {built-in method array}
[...]
    80012    0.069    0.000    0.095    0.000 <frozen 
importlib._bootstrap>:2264(_handle_fromlist)

The import-related stuff is not quite the same: different function, and the 
total amount of time it’s using is less. However, the number of calls still 
seems quite high (80,000, which is 10× the number of loop iterations).

Much appreciate your help. Let me know what other information would be useful.

Thanks,
Reid

Reply via email to