spark git commit: Log warnings for numIterations * miniBatchFraction < 1.0

2016-05-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/branch-2.0 f7158c482 -> 0064a4dcb


Log warnings for numIterations * miniBatchFraction < 1.0

## What changes were proposed in this pull request?

Add a warning log for the case that `numIterations * miniBatchFraction <1.0` 
during gradient descent. If the product of those two numbers is less than 
`1.0`, then not all training examples will be used during optimization. To put 
this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` 
and `numIterations = 3`. Then, 3 iterations will occur each sampling 
approximately 6 examples each. In the best case, each of the 6 examples are 
unique; hence 18/100 examples are used.

This may be counter-intuitive to most users and led to the issue during the 
development of another Spark  ML model: 
https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does 
not require the training data set, it would be easier and more intuitive to use 
`RDD.sample`.

## How was this patch tested?

`build/mvn -DskipTests clean package` build succeeds

Author: Gio Borje 

Closes #13265 from Hydrotoast/master.

(cherry picked from commit 589cce93c821ac28e9090a478f6e7465398b7c30)
Signed-off-by: Sean Owen 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0064a4dc
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0064a4dc
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0064a4dc

Branch: refs/heads/branch-2.0
Commit: 0064a4dcbed1d91732a29c2cede464b8d148aeca
Parents: f7158c4
Author: Gio Borje 
Authored: Wed May 25 16:52:31 2016 -0500
Committer: Sean Owen 
Committed: Wed May 25 16:52:48 2016 -0500

--
 .../org/apache/spark/mllib/optimization/GradientDescent.scala   | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/0064a4dc/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
index a67ea83..735e780 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
@@ -197,6 +197,11 @@ object GradientDescent extends Logging {
 "< 1.0 can be unstable because of the stochasticity in sampling.")
 }
 
+if (numIterations * miniBatchFraction < 1.0) {
+  logWarning("Not all examples will be used if numIterations * 
miniBatchFraction < 1.0: " +
+s"numIterations=$numIterations and 
miniBatchFraction=$miniBatchFraction")
+}
+
 val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
 // Record previous weight and current one to calculate solution vector 
difference
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org



spark git commit: Log warnings for numIterations * miniBatchFraction < 1.0

2016-05-25 Thread srowen
Repository: spark
Updated Branches:
  refs/heads/master 9c297df3d -> 589cce93c


Log warnings for numIterations * miniBatchFraction < 1.0

## What changes were proposed in this pull request?

Add a warning log for the case that `numIterations * miniBatchFraction <1.0` 
during gradient descent. If the product of those two numbers is less than 
`1.0`, then not all training examples will be used during optimization. To put 
this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` 
and `numIterations = 3`. Then, 3 iterations will occur each sampling 
approximately 6 examples each. In the best case, each of the 6 examples are 
unique; hence 18/100 examples are used.

This may be counter-intuitive to most users and led to the issue during the 
development of another Spark  ML model: 
https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does 
not require the training data set, it would be easier and more intuitive to use 
`RDD.sample`.

## How was this patch tested?

`build/mvn -DskipTests clean package` build succeeds

Author: Gio Borje 

Closes #13265 from Hydrotoast/master.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/589cce93
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/589cce93
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/589cce93

Branch: refs/heads/master
Commit: 589cce93c821ac28e9090a478f6e7465398b7c30
Parents: 9c297df
Author: Gio Borje 
Authored: Wed May 25 16:52:31 2016 -0500
Committer: Sean Owen 
Committed: Wed May 25 16:52:31 2016 -0500

--
 .../org/apache/spark/mllib/optimization/GradientDescent.scala   | 5 +
 1 file changed, 5 insertions(+)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/589cce93/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
--
diff --git 
a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
 
b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
index a67ea83..735e780 100644
--- 
a/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
+++ 
b/mllib/src/main/scala/org/apache/spark/mllib/optimization/GradientDescent.scala
@@ -197,6 +197,11 @@ object GradientDescent extends Logging {
 "< 1.0 can be unstable because of the stochasticity in sampling.")
 }
 
+if (numIterations * miniBatchFraction < 1.0) {
+  logWarning("Not all examples will be used if numIterations * 
miniBatchFraction < 1.0: " +
+s"numIterations=$numIterations and 
miniBatchFraction=$miniBatchFraction")
+}
+
 val stochasticLossHistory = new ArrayBuffer[Double](numIterations)
 // Record previous weight and current one to calculate solution vector 
difference
 


-
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org