No, colStats() computes all summary statistics in one pass and store the values. It is not lazy.
On Mon, Jan 12, 2015 at 4:42 AM, Rok Roskar <rokros...@gmail.com> wrote: > This was without using Kryo -- if I use kryo, I got errors about buffer > overflows (see above): > > com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, > required: 8 > > Just calling colStats doesn't actually compute those statistics, does it? It > looks like the computation is only carried out once you call the .mean() > method. > > > > On Sat, Jan 10, 2015 at 7:04 AM, Xiangrui Meng <men...@gmail.com> wrote: >> >> colStats() computes the mean values along with several other summary >> statistics, which makes it slower. How is the performance if you don't >> use kryo? -Xiangrui >> >> On Fri, Jan 9, 2015 at 3:46 AM, Rok Roskar <rokros...@gmail.com> wrote: >> > thanks for the suggestion -- however, looks like this is even slower. >> > With >> > the small data set I'm using, my aggregate function takes ~ 9 seconds >> > and >> > the colStats.mean() takes ~ 1 minute. However, I can't get it to run >> > with >> > the Kyro serializer -- I get the error: >> > >> > com.esotericsoftware.kryo.KryoException: Buffer overflow. Available: 5, >> > required: 8 >> > >> > is there an easy/obvious fix? >> > >> > >> > On Wed, Jan 7, 2015 at 7:30 PM, Xiangrui Meng <men...@gmail.com> wrote: >> >> >> >> There is some serialization overhead. You can try >> >> >> >> >> >> https://github.com/apache/spark/blob/master/python/pyspark/mllib/stat.py#L107 >> >> . -Xiangrui >> >> >> >> On Wed, Jan 7, 2015 at 9:42 AM, rok <rokros...@gmail.com> wrote: >> >> > I have an RDD of SparseVectors and I'd like to calculate the means >> >> > returning >> >> > a dense vector. I've tried doing this with the following (using >> >> > pyspark, >> >> > spark v1.2.0): >> >> > >> >> > def aggregate_partition_values(vec1, vec2) : >> >> > vec1[vec2.indices] += vec2.values >> >> > return vec1 >> >> > >> >> > def aggregate_combined_vectors(vec1, vec2) : >> >> > if all(vec1 == vec2) : >> >> > # then the vector came from only one partition >> >> > return vec1 >> >> > else: >> >> > return vec1 + vec2 >> >> > >> >> > means = vals.aggregate(np.zeros(vec_len), aggregate_partition_values, >> >> > aggregate_combined_vectors) >> >> > means = means / nvals >> >> > >> >> > This turns out to be really slow -- and doesn't seem to depend on how >> >> > many >> >> > vectors there are so there seems to be some overhead somewhere that >> >> > I'm >> >> > not >> >> > understanding. Is there a better way of doing this? >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> > >> >> > http://apache-spark-user-list.1001560.n3.nabble.com/calculating-the-mean-of-SparseVector-RDD-tp21019.html >> >> > Sent from the Apache Spark User List mailing list archive at >> >> > Nabble.com. >> >> > >> >> > --------------------------------------------------------------------- >> >> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> > For additional commands, e-mail: user-h...@spark.apache.org >> >> > >> > >> > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org