[ 
https://issues.apache.org/jira/browse/SPARK-40920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625053#comment-17625053
 ] 

Leonard Papenmeier commented on SPARK-40920:
--------------------------------------------

Using  .repartition(10), .toIndexedRowMatrix(), and .sortBy(lambda r: r.index) 
produces the correct U - without sorting, the order is mixed up. 

I agree that this behavior is unexpected; the rows should be in the right 
order. 

> SVD: matrix U has wrong row order
> ---------------------------------
>
>                 Key: SPARK-40920
>                 URL: https://issues.apache.org/jira/browse/SPARK-40920
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, PySpark
>    Affects Versions: 3.3.0
>         Environment: Python 3.10, multi-core machine, no cluster
>            Reporter: Leonard Papenmeier
>            Priority: Major
>         Attachments: image-2022-10-26-13-58-52-998.png, 
> image-2022-10-26-13-59-04-608.png, image-2022-10-26-13-59-13-425.png
>
>
> When performing SVD on a RowMatrix, the matrix U has the wrong row order and 
> the original matrix is not correctly restored with the given matrix. 
>  
> Consider the following code:
> {code:java}
> x_np = np.random.random((14, 3)) # the size matters, it works for smaller 
> sizes
> x = ctx.parallelize(x_np).zipWithIndex().map(
>     lambda r: [MatrixEntry(r[1], i, r[0][i]) for i in range(len(r[0]))])
> x = CoordinateMatrix(x.flatMap(lambda x: x))
> x_inv = matrix_inverse(x) {code}
> with 
> {code:java}
> def matrix_inverse(matrix: CoordinateMatrix) -> DenseMatrix:
>     mtrx = matrix.toRowMatrix()
>     svd = matrix.toRowMatrix().computeSVD(k=mtrx.numCols(), computeU=True, 
> rCond=1e-15)  # do the SVD
>     s_inv = 1 / svd.s
>     mtrx_orig = matrix.toBlockMatrix().blocks.first()[1].toArray()
>     u_dense = mtrx_orig @ (svd.V.toArray() * s_inv[np.newaxis, :])
>     cov_inv = np.matmul(svd.V.toArray(), np.multiply(s_inv[:, np.newaxis], 
> u_dense.T))
>     u_from_spark = np.array(svd.U.rows.map(lambda x: x.toArray()).collect())
>     return DenseMatrix(numRows=cov_inv.shape[0], numCols=cov_inv.shape[1],
>                        values=cov_inv.ravel(order="F"))  # return inverse as 
> dense matrix {code}
> Then, u_dense is the correct U but differs from the U produced by Spark. In 
> particular, the U in Spark does not return the correct pseudoinverse and 
> U@[S@V.T|mailto:S@V.T] does not reproduce the input matrix. 
>  
> With the following input matrix x
> !image-2022-10-26-13-58-52-998.png!
> I get the following u_dense
> !image-2022-10-26-13-59-04-608.png!
> but the following u_from_spark
> !image-2022-10-26-13-59-13-425.png!
>  
> On careful inspection, it seems that the row order is wrong.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to