I think this is not a problem of PySpark, you also saw this if you
profile this script:

```
list(map(map_, range(sc.defaultParallelism)))
```

81777/80874    0.086    0.000    0.360    0.000 <frozen
importlib._bootstrap>:2264(_handle_fromlist)



On Thu, Sep 3, 2015 at 11:16 AM, Priedhorsky, Reid <rei...@lanl.gov> wrote:
>
> On Sep 2, 2015, at 11:31 PM, Davies Liu <dav...@databricks.com> wrote:
>
> Could you have a short script to reproduce this?
>
>
> Good point. Here you go. This is Python 3.4.3 on Ubuntu 15.04.
>
> import pandas as pd  # must be in default path for interpreter
> import pyspark
>
> LEN = 260
> ITER_CT = 10000
>
> conf = pyspark.SparkConf()
> conf.set('spark.python.profile', 'true')
> sc = pyspark.SparkContext(conf=conf)
>
> a = sc.broadcast(pd.Series(range(LEN)))
>
> def map_(i):
>    b = pd.Series(range(LEN))
>    for i in range(ITER_CT):
>       b.corr(a.value)
>    return None
>
> shards = sc.parallelize(range(sc.defaultParallelism), sc.defaultParallelism)
> data = shards.map(map_)
> data.collect()
>
> sc.show_profiles()
>
>
> Run as:
>
> $ spark-submit --master local[4] demo.py
>
>
> Here’s a profile excerpt:
>
> ============================================================
> Profile of RDD<id=1>
> ============================================================
>          7160748 function calls (7040732 primitive calls) in 7.075 seconds
>
>    Ordered by: internal time, cumulative time
>
>    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
>     40000    0.551    0.000    1.908    0.000 function_base.py:1780(cov)
>    120000    0.527    0.000    0.527    0.000 {method 'reduce' of
> 'numpy.ufunc' objects}
>    440020    0.493    0.000    0.493    0.000 {built-in method array}
> [...]
>     80012    0.069    0.000    0.095    0.000 <frozen
> importlib._bootstrap>:2264(_handle_fromlist)
>
>
> The import-related stuff is not quite the same: different function, and the
> total amount of time it’s using is less. However, the number of calls still
> seems quite high (80,000, which is 10× the number of loop iterations).
>
> Much appreciate your help. Let me know what other information would be
> useful.
>
> Thanks,
> Reid

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to