Re: calculating the mean of SparseVector RDD

Xiangrui Meng Fri, 09 Jan 2015 22:07:22 -0800

colStats() computes the mean values along with several other summary
statistics, which makes it slower. How is the performance if you don't
use kryo? -Xiangrui


On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar <rokros...@gmail.com> wrote:
> thanks for the suggestion -- however, looks like this is even slower. With
> the small data set I'm using, my aggregate function takes ~ 9 seconds and
> the colStats.mean() takes ~ 1 minute. However, I can't get it to run with
> the Kyro serializer -- I get the error:
>
> com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
> required: 8
>
> is there an easy/obvious fix?
>
>
> On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> There is some serialization overhead. You can try
>>
>> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
>> . -Xiangrui
>>
>> On Wed, Jan 7, 2015 at 9:42 AM, rok <rokros...@gmail.com> wrote:
>> > I have an RDD of SparseVectors and I'd like to calculate the means
>> > returning
>> > a dense vector. I've tried doing this with the following (using pyspark,
>> > spark v1.2.0):
>> >
>> > def aggregate_partition_values(vec1, vec2) :
>> >     vec1[vec2.indices] += vec2.values
>> >     return vec1
>> >
>> > def aggregate_combined_vectors(vec1, vec2) :
>> >     if all(vec1 == vec2) :
>> >         # then the vector came from only one partition
>> >         return vec1
>> >     else:
>> >         return vec1 + vec2
>> >
>> > means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
>> > aggregate_combined_vectors)
>> > means = means / nvals
>> >
>> > This turns out to be really slow -- and doesn't seem to depend on how
>> > many
>> > vectors there are so there seems to be some overhead somewhere that I'm
>> > not
>> > understanding. Is there a better way of doing this?
>> >
>> >
>> >
>> > --
>> > View this message in context:
>> > http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
>> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> > For additional commands, e-mail: user-h...@spark.apache.org
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: calculating the mean of SparseVector RDD

Reply via email to