spark git commit: Log warnings for numIterations * miniBatchFraction < 1.0
Repository: spark Updated Branches: refs/heads/branch-2.0 f7158c482 -> 0064a4dcb Log warnings for numIterations * miniBatchFraction < 1.0 ## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio BorjeCloses #13265 from Hydrotoast/master. (cherry picked from commit 589cce93c821ac28e9090a478f6e7465398b7c30) Signed-off-by: Sean Owen Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0064a4dc Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0064a4dc Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0064a4dc Branch: refs/heads/branch-2.0 Commit: 0064a4dcbed1d91732a29c2cede464b8d148aeca Parents: f7158c4 Author: Gio Borje Authored: Wed May 25 16:52:31 2016 -0500 Committer: Sean Owen Committed: Wed May 25 16:52:48 2016 -0500 -- .../org/apache/spark/mllib/optimization/GradientDescent.scala | 5 + 1 file changed, 5 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/0064a4dc/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala index a67ea83..735e780 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala @@ -197,6 +197,11 @@ object GradientDescent extends Logging { "< 1.0 can be unstable because of the stochasticity in sampling.") } +if (numIterations * miniBatchFraction < 1.0) { + logWarning("Not all examples will be used if numIterations * miniBatchFraction < 1.0: " + +s"numIterations=$numIterations and miniBatchFraction=$miniBatchFraction") +} + val stochasticLossHistory = new ArrayBuffer[Double](numIterations) // Record previous weight and current one to calculate solution vector difference - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org
spark git commit: Log warnings for numIterations * miniBatchFraction < 1.0
Repository: spark Updated Branches: refs/heads/master 9c297df3d -> 589cce93c Log warnings for numIterations * miniBatchFraction < 1.0 ## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio BorjeCloses #13265 from Hydrotoast/master. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/589cce93 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/589cce93 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/589cce93 Branch: refs/heads/master Commit: 589cce93c821ac28e9090a478f6e7465398b7c30 Parents: 9c297df Author: Gio Borje Authored: Wed May 25 16:52:31 2016 -0500 Committer: Sean Owen Committed: Wed May 25 16:52:31 2016 -0500 -- .../org/apache/spark/mllib/optimization/GradientDescent.scala | 5 + 1 file changed, 5 insertions(+) -- http://git-wip-us.apache.org/repos/asf/spark/blob/589cce93/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala -- diff --git a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala index a67ea83..735e780 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala @@ -197,6 +197,11 @@ object GradientDescent extends Logging { "< 1.0 can be unstable because of the stochasticity in sampling.") } +if (numIterations * miniBatchFraction < 1.0) { + logWarning("Not all examples will be used if numIterations * miniBatchFraction < 1.0: " + +s"numIterations=$numIterations and miniBatchFraction=$miniBatchFraction") +} + val stochasticLossHistory = new ArrayBuffer[Double](numIterations) // Record previous weight and current one to calculate solution vector difference - To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org