[ https://issues.apache.org/jira/browse/SPARK-38816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nikolay updated SPARK-38816: ---------------------------- Attachment: image-2022-04-18-15-54-06-679.png > Wrong comment in random matrix generator in spark-als algorithm > ---------------------------------------------------------------- > > Key: SPARK-38816 > URL: https://issues.apache.org/jira/browse/SPARK-38816 > Project: Spark > Issue Type: Improvement > Components: ML > Affects Versions: 3.1.1, 3.1.2, 3.2.1 > Reporter: Nikolay > Assignee: Sean R. Owen > Priority: Minor > Fix For: 3.1.3, 3.3.0, 3.2.2 > > Original Estimate: 24h > Remaining Estimate: 24h > > In algorithm Spark ALS we need initialize nonegative factor matricies for > users and items. > In ALS: > > {code:java} > private def initialize[ID]( > inBlocks: RDD[(Int, InBlock[ID])], > rank: Int, > seed: Long): RDD[(Int, FactorBlock)] = { > // Choose a unit vector uniformly at random from the unit sphere, but from > the > // "first quadrant" where all elements are nonnegative. This can be done by > choosing > // elements distributed as Normal(0,1) and taking the absolute value, and > then normalizing. > // This appears to create factorizations that have a slightly better > reconstruction > // (<1%) compared picking elements uniformly at random in [0,1]. > inBlocks.mapPartitions({ iter => > iter.map { > case (srcBlockId, inBlock) => > val random: XORShiftRandom = new XORShiftRandom(byteswap64(seed ^ > srcBlockId)) > val factors: Array[Array[Float]] = Array.fill(inBlock.srcIds.length) { > val factor = Array.fill(rank)(random.nextGaussian().toFloat) > val nrm: Float = blas.snrm2(rank, factor, 1) > blas.sscal(rank, 1.0f / nrm, factor, 1) > factor > } > (srcBlockId, factors) > } > }, preservesPartitioning = true) > } {code} > In the comments, the author writes that we are generating a matrix filled > with positive numbers. In the code we use random.nextGaussian().toFloat. But > if we look at the documentation of the nextGaussian method, we can see that > it also returns negative numbers: > > {code:java} > /** > * @return the next pseudorandom, Gaussian ("normally") distributed > * {@code double} value with mean {@code 0.0} and > * standard deviation {@code 1.0} from this random number > * generator's sequence > */ > synchronized public double nextGaussian() { > // See Knuth, ACP, Section 3.4.1 Algorithm C. > if (haveNextNextGaussian) { > haveNextNextGaussian = false; > return nextNextGaussian; > } else { > double v1, v2, s; > do { > v1 = 2 * nextDouble() - 1; // between -1 and 1 > v2 = 2 * nextDouble() - 1; // between -1 and 1 > s = v1 * v1 + v2 * v2; > } while (s >= 1 || s == 0); > double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s)/s); > nextNextGaussian = v2 * multiplier; > haveNextNextGaussian = true; > return v1 * multiplier; > } > } > {code} > > The result is a matrix with negative values -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org