[GitHub] spark pull request #16495: SPARK-16920: Add a stress test for evaluateEachIt...
Github user mhmoudr closed the pull request at: https://github.com/apache/spark/pull/16495 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16495: SPARK-16920: Add a stress test for evaluateEachIteration...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/16495 * To traing a (one-off) model a keep it in resources, then load it to run the test. * investigate the added comlexity issue. * To move the test away from unit tests * Provide instructions on how to run the stress test --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #16495: SPARK-16920: Add a stress test for evaluateEachIteration...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/16495 Sure .. I am, but just a bit busy nowadays, in general I am planing to investigate the additional complexity issue, and as for the test I agree with the fact that the test should not be added to unit test as far as there is a specific agreed manner for adding stress tests, also to add, in term of the test itself, I would say it will be a wast of time to train the model every time we need to test this area, so I was thinking of train a model once and save it to resources, then once I need to test the evaluateEachIteration it will be much faster to lead the pre saved model from resources and run the test directly, thoughts? for the time being I will close the pull request and once I have the change ready I will create a new one. Mahmoud --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #16495: SPARK-16920: Add a stress test for evaluateEachIt...
GitHub user mhmoudr opened a pull request: https://github.com/apache/spark/pull/16495 SPARK-16920: Add a stress test for evaluateEachIteration for 2000 trees ## What changes were proposed in this pull request? Just adding a test to prove error by tree is working for 2000 trees, the fix of SPARK-15858 before it was failing to do the calculation after long time ## How was this patch tested? Just run the test You can merge this pull request into a Git repository by running: $ git pull https://github.com/mhmoudr/spark SPARK-16920 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/16495.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16495 commit d3e2dadaa767bd3fec10ba329625ceaa5ccabcbb Author: Mahmoud Rawas <mhmo...@gmail.com> Date: 2017-01-07T02:35:46Z SPARK-16920: Add a stress test for calculating error by tree (evaluateEachIteration) for 2000 trees --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13624: [SPARK-15858][ML]: Fix calculating error by tree ...
Github user mhmoudr commented on a diff in the pull request: https://github.com/apache/spark/pull/13624#discussion_r68669564 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -205,31 +205,28 @@ private[spark] object GradientBoostedTrees extends Logging { case _ => data } -val numIterations = trees.length -val evaluationArray = Array.fill(numIterations)(0.0) -val localTreeWeights = treeWeights - -var predictionAndError = computeInitialPredictionAndError( - remappedData, localTreeWeights(0), trees(0), loss) - -evaluationArray(0) = predictionAndError.values.mean() - val broadcastTrees = sc.broadcast(trees) -(1 until numIterations).foreach { nTree => - predictionAndError = remappedData.zip(predictionAndError).mapPartitions { iter => -val currentTree = broadcastTrees.value(nTree) -val currentTreeWeight = localTreeWeights(nTree) -iter.map { case (point, (pred, error)) => - val newPred = updatePrediction(point.features, pred, currentTree, currentTreeWeight) - val newError = loss.computeError(newPred, point.label) - (newPred, newError) -} +val localTreeWeights = treeWeights +val treesIndices = trees.indices + +val dataCount = remappedData.count() +val evaluation = remappedData.map { point => + treesIndices.map { idx => +val prediction = broadcastTrees.value(idx) + .rootNode + .predictImpl(point.features) + .prediction +prediction * localTreeWeights(idx) } - evaluationArray(nTree) = predictionAndError.values.mean() + .scanLeft(0.0)(_ + _).drop(1) + .map(prediction => loss.computeError(prediction, point.label)) } - -broadcastTrees.unpersist() -evaluationArray + .aggregate(treesIndices.map(_ => 0.0))( --- End diff -- Done. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13624: [SPARK-15858][ML]: Fix calculating error by tree ...
Github user mhmoudr commented on a diff in the pull request: https://github.com/apache/spark/pull/13624#discussion_r68513737 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/tree/model/treeEnsembleModels.scala --- @@ -151,31 +151,23 @@ class GradientBoostedTreesModel @Since("1.2.0") ( case _ => data } -val numIterations = trees.length -val evaluationArray = Array.fill(numIterations)(0.0) -val localTreeWeights = treeWeights - -var predictionAndError = GradientBoostedTreesModel.computeInitialPredictionAndError( - remappedData, localTreeWeights(0), trees(0), loss) - -evaluationArray(0) = predictionAndError.values.mean() - val broadcastTrees = sc.broadcast(trees) -(1 until numIterations).foreach { nTree => - predictionAndError = remappedData.zip(predictionAndError).mapPartitions { iter => -val currentTree = broadcastTrees.value(nTree) -val currentTreeWeight = localTreeWeights(nTree) -iter.map { case (point, (pred, error)) => - val newPred = pred + currentTree.predict(point.features) * currentTreeWeight - val newError = loss.computeError(newPred, point.label) - (newPred, newError) -} - } - evaluationArray(nTree) = predictionAndError.values.mean() +val localTreeWeights = treeWeights +val treesIndices = trees.indices + +val dataCount = remappedData.count() +val evaluation = remappedData.map { point => + treesIndices +.map(idx => broadcastTrees.value(idx).predict(point.features) * localTreeWeights(idx)) +.scanLeft(0.0)(_ + _).drop(1) +.map(prediction => loss.computeError(prediction, point.label)) } - -broadcastTrees.unpersist() -evaluationArray + .aggregate(treesIndices.map(_ => 0.0))( +(aggregated, row) => treesIndices.map(idx => aggregated(idx) + row(idx)), +(a, b) => treesIndices.map(idx => a(idx) + b(idx))) + .map(d => d / dataCount) --- End diff -- my bad .. fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13624: [SPARK-15858][ML]: Fix calculating error by tree ...
Github user mhmoudr commented on a diff in the pull request: https://github.com/apache/spark/pull/13624#discussion_r68513735 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -205,31 +205,27 @@ private[spark] object GradientBoostedTrees extends Logging { case _ => data } -val numIterations = trees.length -val evaluationArray = Array.fill(numIterations)(0.0) -val localTreeWeights = treeWeights - -var predictionAndError = computeInitialPredictionAndError( - remappedData, localTreeWeights(0), trees(0), loss) - -evaluationArray(0) = predictionAndError.values.mean() - val broadcastTrees = sc.broadcast(trees) -(1 until numIterations).foreach { nTree => - predictionAndError = remappedData.zip(predictionAndError).mapPartitions { iter => -val currentTree = broadcastTrees.value(nTree) -val currentTreeWeight = localTreeWeights(nTree) -iter.map { case (point, (pred, error)) => - val newPred = updatePrediction(point.features, pred, currentTree, currentTreeWeight) - val newError = loss.computeError(newPred, point.label) - (newPred, newError) -} - } - evaluationArray(nTree) = predictionAndError.values.mean() +val localTreeWeights = treeWeights +val treesIndices = trees.indices + +val dataCount = remappedData.count() +val evaluation = remappedData.map { point => + treesIndices.map { idx => +val prediction = broadcastTrees.value(idx) + .rootNode + .predictImpl(point.features) + .prediction +prediction * localTreeWeights(idx) + }.scanLeft(0.0)(_ + _).drop(1) --- End diff -- I don't fully agree with this, as it will confuse the reader with a new line if I align it with the `{` above it, while `.scanLeft` still belongs to the line / instructions started at `treesIndices.map { idx =>` Anyway if that's is the rule for this repository I will apply the change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13624: [SPARK-15858][ML]: Fix calculating error by tree ...
Github user mhmoudr commented on a diff in the pull request: https://github.com/apache/spark/pull/13624#discussion_r68497920 --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/impl/GradientBoostedTrees.scala --- @@ -205,31 +205,29 @@ private[spark] object GradientBoostedTrees extends Logging { case _ => data } -val numIterations = trees.length -val evaluationArray = Array.fill(numIterations)(0.0) -val localTreeWeights = treeWeights - -var predictionAndError = computeInitialPredictionAndError( - remappedData, localTreeWeights(0), trees(0), loss) - -evaluationArray(0) = predictionAndError.values.mean() - val broadcastTrees = sc.broadcast(trees) -(1 until numIterations).foreach { nTree => - predictionAndError = remappedData.zip(predictionAndError).mapPartitions { iter => -val currentTree = broadcastTrees.value(nTree) -val currentTreeWeight = localTreeWeights(nTree) -iter.map { case (point, (pred, error)) => - val newPred = updatePrediction(point.features, pred, currentTree, currentTreeWeight) - val newError = loss.computeError(newPred, point.label) - (newPred, newError) -} +val localTreeWeights = treeWeights +val treesIndices = trees.indices + +val dataCount = remappedData.count() +val evaluation = remappedData.map { point => + treesIndices.map { idx => { +val prediction = broadcastTrees.value(idx) + .rootNode + .predictImpl(point.features) + .prediction +prediction * localTreeWeights(idx) } - evaluationArray(nTree) = predictionAndError.values.mean() + } +.scanLeft(0.0)(_ + _).drop(1) --- End diff -- I was just relying on Intellij to adjust all the indents, the only issue is that if I join it with the previous line the second line will look even worse. is there any scala style rules to be applied automatically on build time so we avoid going into this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13624: [SPARK-15858][ML]: Fix calculating error by tree stack o...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/13624 Applied the required changes on styles. (not changes on the logic) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13590: SPARK-15858: Fix calculating error by tree stack ...
Github user mhmoudr closed the pull request at: https://github.com/apache/spark/pull/13590 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13590: SPARK-15858: Fix calculating error by tree stack over fl...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/13590 PR #13624 has been created for master branch. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13588: SPARK-15858: Fix calculating error by tree stack over fl...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/13588 PR #13624 has been created into master branch for the same fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13624: [SPARK-15858][ML]: Fix calculating error by tree ...
GitHub user mhmoudr opened a pull request: https://github.com/apache/spark/pull/13624 [SPARK-15858][ML]: Fix calculating error by tree stack over flow prob⦠## What changes were proposed in this pull request? What changes were proposed in this pull request? Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees ## How was this patch tested? the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail. **PS**: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0 You can merge this pull request into a Git repository by running: $ git pull https://github.com/mhmoudr/spark SPARK-15858.master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13624.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13624 commit 7e9e383bbf8246fff292f506bda3aa636b6bd263 Author: Mahmoud Rawas <mhmo...@gmail.com> Date: 2016-06-12T11:52:01Z [SPARK-15858][ML]: Fix calculating error by tree stack over flow problem and over memory allocation issue for a model that have 2000+ trees. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13588: SPARK-15858: Fix calculating error by tree stack over fl...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/13588 Thanks for the reply.. I can still see the same code that cause the problem in master, right now I am working on a PR for master that have the fix. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13588: SPARK-15858: Fix calculating error by tree stack ...
Github user mhmoudr closed the pull request at: https://github.com/apache/spark/pull/13588 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13588: SPARK-15858: Fix calculating error by tree stack over fl...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/13588 I am going to close this PR, but my question is what is the process of getting a fix into one of the release branches? as this does not introduce any new functionality but just fixes an existing issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #13588: SPARK-15858: Fix calculating error by tree stack over fl...
Github user mhmoudr commented on the issue: https://github.com/apache/spark/pull/13588 This PR contains exactly the same fix but targeting version 1.6 as if there is a plan to release 1.6.2 in the future, if that was not the case let me know to close it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13590: SPARK-15858: Fix calculating error by tree stack ...
GitHub user mhmoudr opened a pull request: https://github.com/apache/spark/pull/13590 SPARK-15858: Fix calculating error by tree stack over flow problem an⦠## What changes were proposed in this pull request? Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees ## How was this patch tested? the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mhmoudr/spark SPARK-15858 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13590.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13590 commit a72dbdd0d60eec346d66bb9ab728643c9d4bad0f Author: Mahmoud Rawas <mhmo...@gmail.com> Date: 2016-06-10T01:27:21Z SPARK-15858: Fix calculating error by tree stack over flow problem and over memory allocation issue for a model that have 2000+ trees. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #13588: SPARK-15858: Fix calculating error by tree stack ...
GitHub user mhmoudr opened a pull request: https://github.com/apache/spark/pull/13588 SPARK-15858: Fix calculating error by tree stack over flow problem an⦠## What changes were proposed in this pull request? Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees ## How was this patch tested? the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mhmoudr/spark SPARK-15858.1.6 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/13588.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #13588 commit 4726937bacd6ee43dd12b27e1746bc708e99c6da Author: Mahmoud Rawas <mhmo...@gmail.com> Date: 2016-06-10T01:27:21Z SPARK-15858: Fix calculating error by tree stack over flow problem and over memory allocation issue for a model that have 2000+ trees. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org