[ 
https://issues.apache.org/jira/browse/SPARK-38816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nikolay updated SPARK-38816:
----------------------------
    Attachment: image-2022-04-18-15-54-06-679.png

> Wrong comment in random matrix generator in spark-als algorithm 
> ----------------------------------------------------------------
>
>                 Key: SPARK-38816
>                 URL: https://issues.apache.org/jira/browse/SPARK-38816
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 3.1.1, 3.1.2, 3.2.1
>            Reporter: Nikolay
>            Assignee: Sean R. Owen
>            Priority: Minor
>             Fix For: 3.1.3, 3.3.0, 3.2.2
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In algorithm Spark ALS we need initialize nonegative factor matricies for 
> users and items. 
> In ALS:
>  
> {code:java}
> private def initialize[ID](
>     inBlocks: RDD[(Int, InBlock[ID])],
>     rank: Int,
>     seed: Long): RDD[(Int, FactorBlock)] = {
>   // Choose a unit vector uniformly at random from the unit sphere, but from 
> the
>   // "first quadrant" where all elements are nonnegative. This can be done by 
> choosing
>   // elements distributed as Normal(0,1) and taking the absolute value, and 
> then normalizing.
>   // This appears to create factorizations that have a slightly better 
> reconstruction
>   // (<1%) compared picking elements uniformly at random in [0,1].
>   inBlocks.mapPartitions({ iter =>
>     iter.map {
>       case (srcBlockId, inBlock) =>
>         val random: XORShiftRandom = new XORShiftRandom(byteswap64(seed ^ 
> srcBlockId))
>         val factors: Array[Array[Float]] = Array.fill(inBlock.srcIds.length) {
>           val factor = Array.fill(rank)(random.nextGaussian().toFloat)
>           val nrm: Float = blas.snrm2(rank, factor, 1)
>           blas.sscal(rank, 1.0f / nrm, factor, 1)
>           factor
>         }
>         (srcBlockId, factors)
>     }
>   }, preservesPartitioning = true)
> } {code}
> In the comments, the author writes that we are generating a matrix filled 
> with positive numbers. In the code we use random.nextGaussian().toFloat. But 
> if we look at the documentation of the nextGaussian method, we can see that 
> it also returns negative numbers: 
>  
> {code:java}
> /** 
> * @return the next pseudorandom, Gaussian ("normally") distributed
>  *         {@code double} value with mean {@code 0.0} and
>  *         standard deviation {@code 1.0} from this random number
>  *         generator's sequence
>  */
> synchronized public double nextGaussian() {
>     // See Knuth, ACP, Section 3.4.1 Algorithm C.
>     if (haveNextNextGaussian) {
>         haveNextNextGaussian = false;
>         return nextNextGaussian;
>     } else {
>         double v1, v2, s;
>         do {
>             v1 = 2 * nextDouble() - 1; // between -1 and 1
>             v2 = 2 * nextDouble() - 1; // between -1 and 1
>             s = v1 * v1 + v2 * v2;
>         } while (s >= 1 || s == 0);
>         double multiplier = StrictMath.sqrt(-2 * StrictMath.log(s)/s);
>         nextNextGaussian = v2 * multiplier;
>         haveNextNextGaussian = true;
>         return v1 * multiplier;
>     }
> }
>  {code}
>  
> The result is a matrix with negative values



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to