Re: calculating the mean of SparseVector RDD

Xiangrui Meng Mon, 12 Jan 2015 14:37:16 -0800

No, colStats() computes all summary statistics in one pass and store
the values. It is not lazy.


On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar <rokros...@gmail.com> wrote:
> This was without using Kryo -- if I use kryo, I got errors about buffer
> overflows (see above):
>
> com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
> required: 8
>
> Just calling colStats doesn't actually compute those statistics, does it? It
> looks like the computation is only carried out once you call the .mean()
> method.
>
>
>
> On Sat, Jan 10, 2015 at 7:04 AM, Xiangrui Meng <men...@gmail.com> wrote:
>>
>> colStats() computes the mean values along with several other summary
>> statistics, which makes it slower. How is the performance if you don't
>> use kryo? -Xiangrui
>>
>> On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar <rokros...@gmail.com> wrote:
>> > thanks for the suggestion -- however, looks like this is even slower.
>> > With
>> > the small data set I'm using, my aggregate function takes ~ 9 seconds
>> > and
>> > the colStats.mean() takes ~ 1 minute. However, I can't get it to run
>> > with
>> > the Kyro serializer -- I get the error:
>> >
>> > com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5,
>> > required: 8
>> >
>> > is there an easy/obvious fix?
>> >
>> >
>> > On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng <men...@gmail.com> wrote:
>> >>
>> >> There is some serialization overhead. You can try
>> >>
>> >>
>> >> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107
>> >> . -Xiangrui
>> >>
>> >> On Wed, Jan 7, 2015 at 9:42 AM, rok <rokros...@gmail.com> wrote:
>> >> > I have an RDD of SparseVectors and I'd like to calculate the means
>> >> > returning
>> >> > a dense vector. I've tried doing this with the following (using
>> >> > pyspark,
>> >> > spark v1.2.0):
>> >> >
>> >> > def aggregate_partition_values(vec1, vec2) :
>> >> >     vec1[vec2.indices] += vec2.values
>> >> >     return vec1
>> >> >
>> >> > def aggregate_combined_vectors(vec1, vec2) :
>> >> >     if all(vec1 == vec2) :
>> >> >         # then the vector came from only one partition
>> >> >         return vec1
>> >> >     else:
>> >> >         return vec1 + vec2
>> >> >
>> >> > means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values,
>> >> > aggregate_combined_vectors)
>> >> > means = means / nvals
>> >> >
>> >> > This turns out to be really slow -- and doesn't seem to depend on how
>> >> > many
>> >> > vectors there are so there seems to be some overhead somewhere that
>> >> > I'm
>> >> > not
>> >> > understanding. Is there a better way of doing this?
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> >> > http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html
>> >> > Sent from the Apache Spark User List mailing list archive at
>> >> > Nabble.com.
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> > For additional commands, e-mail: user-h...@spark.apache.org
>> >> >
>> >
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: calculating the mean of SparseVector RDD

Reply via email to