I have an RDD of SparseVectors and I'd like to calculate the means returning
a dense vector. I've tried doing this with the following (using pyspark,
spark v1.2.0): 

def aggregate_partition_values(vec1, vec2) :
    vec1[vec2.indices] += vec2.values
    return vec1

def aggregate_combined_vectors(vec1, vec2) : 
    if all(vec1 == vec2) : 
        # then the vector came from only one partition
        return vec1
    else:
        return vec1 + vec2

means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
aggregate_combined_vectors)
means = means / nvals

This turns out to be really slow -- and doesn't seem to depend on how many
vectors there are so there seems to be some overhead somewhere that I'm not
understanding. Is there a better way of doing this? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to