[ https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-24300: -------------------------------------- Shepherd: Joseph K. Bradley > generateLDAData in ml.cluster.LDASuite didn't set seed correctly > ---------------------------------------------------------------- > > Key: SPARK-24300 > URL: https://issues.apache.org/jira/browse/SPARK-24300 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.3.0 > Reporter: Xiangrui Meng > Assignee: Lu Wang > Priority: Minor > > [https://github.com/apache/spark/blob/0d63eb8888d17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala] > > generateLDAData uses the same RNG in all partitions to generate random data. > This either causes duplicate rows in cluster mode or indeterministic behavior > in local mode: > {code:java} > scala> val rng = new java.util.Random(10) > rng: java.util.Random = java.util.Random@78c5ef58 > scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) > }.collect().mkString("\n") > res12: String = > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4) > List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code} > We should create one RNG per partition to make it safe. > > cc: [~lu.DB] [~josephkb] -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org