[ https://issues.apache.org/jira/browse/SPARK-21680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-21680. ------------------------------- Resolution: Fixed Fix Version/s: 2.3.0 Issue resolved by pull request 18899 [https://github.com/apache/spark/pull/18899] > ML/MLLIB Vector compressed optimization > --------------------------------------- > > Key: SPARK-21680 > URL: https://issues.apache.org/jira/browse/SPARK-21680 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib > Affects Versions: 2.3.0 > Reporter: Peng Meng > Fix For: 2.3.0 > > > When use Vector.compressed to change a Vector to SparseVector, the > performance is very low comparing with Vector.toSparse. > This is because you have to scan the value three times using > Vector.compressed, but you just need two times when use Vector.toSparse. > When the length of the vector is large, there is significant performance > difference between this two method. > Code of Vector compressed: > {code:java} > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > toSparse > } else { > toDense > } > } > {code} > I propose to change it to: > {code:java} > // Some comments here > def compressed: Vector = { > val nnz = numNonzeros > // A dense vector needs 8 * size + 8 bytes, while a sparse vector needs > 12 * nnz + 20 bytes. > if (1.5 * (nnz + 1.0) < size) { > val ii = new Array[Int](nnz) > val vv = new Array[Double](nnz) > var k = 0 > foreachActive { (i, v) => > if (v != 0) { > ii(k) = i > vv(k) = v > k += 1 > } > } > new SparseVector(size, ii, vv) > } else { > toDense > } > } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org