thanks for the suggestion -- however, looks like this is even slower. With
the small data set I'm using, my aggregate function takes ~ 9 seconds and
the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
the Kyro serializer -- I get the error:

com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
required: 8

is there an easy/obvious fix?


On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng <men...@gmail.com> wrote:

> There is some serialization overhead. You can try
>
> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
> . -Xiangrui
>
> On Wed, Jan 7, 2015 at 9:42 AM, rok <rokros...@gmail.com> wrote:
> > I have an RDD of SparseVectors and I'd like to calculate the means
> returning
> > a dense vector. I've tried doing this with the following (using pyspark,
> > spark v1.2.0):
> >
> > def aggregate_partition_values(vec1, vec2) :
> >     vec1[vec2.indices] += vec2.values
> >     return vec1
> >
> > def aggregate_combined_vectors(vec1, vec2) :
> >     if all(vec1 == vec2) :
> >         # then the vector came from only one partition
> >         return vec1
> >     else:
> >         return vec1 + vec2
> >
> > means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
> > aggregate_combined_vectors)
> > means = means / nvals
> >
> > This turns out to be really slow -- and doesn't seem to depend on how
> many
> > vectors there are so there seems to be some overhead somewhere that I'm
> not
> > understanding. Is there a better way of doing this?
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
> >
>

Reply via email to