I have an RDD of SparseVectors and I'd like to calculate the means returning a dense vector. I've tried doing this with the following (using pyspark, spark v1.2.0):
def aggregate_partition_values(vec1, vec2) : vec1[vec2.indices] += vec2.values return vec1 def aggregate_combined_vectors(vec1, vec2) : if all(vec1 == vec2) : # then the vector came from only one partition return vec1 else: return vec1 + vec2 means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values, aggregate_combined_vectors) means = means / nvals This turns out to be really slow -- and doesn't seem to depend on how many vectors there are so there seems to be some overhead somewhere that I'm not understanding. Is there a better way of doing this? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org