There is some serialization overhead. You can try
https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
. -Xiangrui

On Wed, Jan 7, 2015 at 9:42 AM, rok <rokros...@gmail.com> wrote:
> I have an RDD of SparseVectors and I'd like to calculate the means returning
> a dense vector. I've tried doing this with the following (using pyspark,
> spark v1.2.0):
>
> def aggregate_partition_values(vec1, vec2) :
>     vec1[vec2.indices] += vec2.values
>     return vec1
>
> def aggregate_combined_vectors(vec1, vec2) :
>     if all(vec1 == vec2) :
>         # then the vector came from only one partition
>         return vec1
>     else:
>         return vec1 + vec2
>
> means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
> aggregate_combined_vectors)
> means = means / nvals
>
> This turns out to be really slow -- and doesn't seem to depend on how many
> vectors there are so there seems to be some overhead somewhere that I'm not
> understanding. Is there a better way of doing this?
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to