[ https://issues.apache.org/jira/browse/SPARK-40920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625053#comment-17625053 ]
Leonard Papenmeier commented on SPARK-40920: -------------------------------------------- Using .repartition(10), .toIndexedRowMatrix(), and .sortBy(lambda r: r.index) produces the correct U - without sorting, the order is mixed up. I agree that this behavior is unexpected; the rows should be in the right order. > SVD: matrix U has wrong row order > --------------------------------- > > Key: SPARK-40920 > URL: https://issues.apache.org/jira/browse/SPARK-40920 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark > Affects Versions: 3.3.0 > Environment: Python 3.10, multi-core machine, no cluster > Reporter: Leonard Papenmeier > Priority: Major > Attachments: image-2022-10-26-13-58-52-998.png, > image-2022-10-26-13-59-04-608.png, image-2022-10-26-13-59-13-425.png > > > When performing SVD on a RowMatrix, the matrix U has the wrong row order and > the original matrix is not correctly restored with the given matrix. > > Consider the following code: > {code:java} > x_np = np.random.random((14, 3)) # the size matters, it works for smaller > sizes > x = ctx.parallelize(x_np).zipWithIndex().map( > lambda r: [MatrixEntry(r[1], i, r[0][i]) for i in range(len(r[0]))]) > x = CoordinateMatrix(x.flatMap(lambda x: x)) > x_inv = matrix_inverse(x) {code} > with > {code:java} > def matrix_inverse(matrix: CoordinateMatrix) -> DenseMatrix: > mtrx = matrix.toRowMatrix() > svd = matrix.toRowMatrix().computeSVD(k=mtrx.numCols(), computeU=True, > rCond=1e-15) # do the SVD > s_inv = 1 / svd.s > mtrx_orig = matrix.toBlockMatrix().blocks.first()[1].toArray() > u_dense = mtrx_orig @ (svd.V.toArray() * s_inv[np.newaxis, :]) > cov_inv = np.matmul(svd.V.toArray(), np.multiply(s_inv[:, np.newaxis], > u_dense.T)) > u_from_spark = np.array(svd.U.rows.map(lambda x: x.toArray()).collect()) > return DenseMatrix(numRows=cov_inv.shape[0], numCols=cov_inv.shape[1], > values=cov_inv.ravel(order="F")) # return inverse as > dense matrix {code} > Then, u_dense is the correct U but differs from the U produced by Spark. In > particular, the U in Spark does not return the correct pseudoinverse and > U@[S@V.T|mailto:S@V.T] does not reproduce the input matrix. > > With the following input matrix x > !image-2022-10-26-13-58-52-998.png! > I get the following u_dense > !image-2022-10-26-13-59-04-608.png! > but the following u_from_spark > !image-2022-10-26-13-59-13-425.png! > > On careful inspection, it seems that the row order is wrong. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org