[jira] [Commented] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338360#comment-16338360 ] Siddharth Murching commented on SPARK-23205: Working on a PR to address this issue > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
[ https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338360#comment-16338360 ] Siddharth Murching edited comment on SPARK-23205 at 1/24/18 10:40 PM: -- I'm working on a PR to address this issue if that's alright :) was (Author: siddharth murching): Working on a PR to address this issue > ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel > images > > > Key: SPARK-23205 > URL: https://issues.apache.org/jira/browse/SPARK-23205 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Critical > > When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color > constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] > that sets alpha = 255, even for four-channel images. > See the offending line here: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 > A fix is to simply update the line to: > val color = new Color(img.getRGB(w, h), nChannels == 4) > instead of > val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images
Siddharth Murching created SPARK-23205: -- Summary: ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images Key: SPARK-23205 URL: https://issues.apache.org/jira/browse/SPARK-23205 Project: Spark Issue Type: Bug Components: ML, MLlib Affects Versions: 2.3.0 Reporter: Siddharth Murching When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)] that sets alpha = 255, even for four-channel images. See the offending line here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172 A fix is to simply update the line to: val color = new Color(img.getRGB(w, h), nChannels == 4) instead of val color = new Color(img.getRGB(w, h)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192201#comment-16192201 ] Siddharth Murching edited comment on SPARK-3162 at 10/4/17 11:35 PM: - Commenting here to note that I'd like to resume work on this issue; I've made a new PR^ was (Author: siddharth murching): Commenting here to note that I'm resuming work on this issue; I've made a new PR^ > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192201#comment-16192201 ] Siddharth Murching commented on SPARK-3162: --- Commenting here to note that I'm resuming work on this issue; I've made a new PR^ > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161634#comment-16161634 ] Siddharth Murching commented on SPARK-21972: Link to old PR containing work on this: https://github.com/apache/spark/pull/17014 > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Comment: was deleted (was: This issue was originally being worked on in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]) > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624 ] Siddharth Murching edited comment on SPARK-21972 at 9/11/17 5:22 PM: - This issue was originally being worked on in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] was (Author: siddharth murching): This issue is being worked on in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately > ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) > check the persistence level of the input dataset but not any of its parents. > These issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624 ] Siddharth Murching commented on SPARK-21972: Work has already begun on this in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}} (algorithms will not try to persist uncached input). > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624 ] Siddharth Murching edited comment on SPARK-21972 at 9/11/17 3:46 AM: - This issue is being worked on in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] was (Author: siddharth murching): Work has already begun on this in this PR: [https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014] > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents. These issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}} (algorithms will not try to persist uncached input). was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents. These > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}} (algorithms will > not try to persist uncached input). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call {{cache()}} on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean {{handlePersistence}} param (org.apache.spark.ml.param) so that users can specify whether an ML algorithm should try to cache un-cached input data. {{handlePersistence}} will be {{true}} by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting {{handlePersistence}} to {{false}}. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call {{cache()}} on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents; these > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean {{handlePersistence}} param > (org.apache.spark.ml.param) so that users can specify whether an ML algorithm > should try to cache un-cached input data. {{handlePersistence}} will be > {{true}} by default, corresponding to existing behavior (always persisting > uncached input), but users can achieve finer-grained control over input > persistence by setting {{handlePersistence}} to {{false}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (as described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; both of these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call `cache()` on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents; these > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. > This ticket proposes adding a boolean `handlePersistence` param > (org.apache.spark.ml.param) to the abovementioned estimators so that users > can specify whether an ML algorithm should try to cache un-cached input data. > `handlePersistence` will be `true` by default, corresponding to existing > behavior (always persisting uncached input), but users can achieve > finer-grained control over input persistence by setting `handlePersistence` > to `false`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
[ https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21972: --- Description: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. was: Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (see [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. > Allow users to control input data persistence in ML Estimators via a > handlePersistence ml.Param > --- > > Key: SPARK-21972 > URL: https://issues.apache.org/jira/browse/SPARK-21972 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, > etc) call `cache()` on uncached input datasets to improve performance. > Unfortunately, these algorithms a) check input persistence inaccurately (see > [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check > the persistence level of the input dataset but not any of its parents; these > issues can result in unwanted double-caching of input data & degraded > performance (see > [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]). > This ticket proposes adding a boolean `handlePersistence` param > (org.apache.spark.ml.param) to the abovementioned estimators so that users > can specify whether an ML algorithm should try to cache un-cached input data. > `handlePersistence` will be `true` by default, corresponding to existing > behavior (always persisting uncached input), but users can achieve > finer-grained control over input persistence by setting `handlePersistence` > to `false`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param
Siddharth Murching created SPARK-21972: -- Summary: Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param Key: SPARK-21972 URL: https://issues.apache.org/jira/browse/SPARK-21972 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.2.0 Reporter: Siddharth Murching Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) call `cache()` on uncached input datasets to improve performance. Unfortunately, these algorithms a) check input persistence inaccurately (as described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check the persistence level of the input dataset but not any of its parents; both of these issues can result in unwanted double-caching of input data & degraded performance (see [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]. This ticket proposes adding a boolean `handlePersistence` param (org.apache.spark.ml.param) to the abovementioned estimators so that users can specify whether an ML algorithm should try to cache un-cached input data. `handlePersistence` will be `true` by default, corresponding to existing behavior (always persisting uncached input), but users can achieve finer-grained control over input persistence by setting `handlePersistence` to `false`. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21799: --- Description: I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` returns the correct result after calling `df.cache()`, so I'd suggest replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). I've verified this behavior in [this notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] was: I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` returns the correct result after calling `df.cache()`, so I'd suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). I've verified this behavior in [this notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] > KMeans performance regression (5-6x slowdown) in Spark 2.2 > -- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 100 examples, 1 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` > returns the correct result after calling `df.cache()`, so I'd suggest > replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in > MLlib algorithms (the same pattern shows up in LogisticRegression, > LinearRegression, and others). I've verified this behavior in [this > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21799: --- Description: I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` returns the correct result after calling `df.cache()`, so I'd suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). I've verified this behavior in [this notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] was: I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does set the public `df.storageLevel` member properly, so I'd suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). > KMeans performance regression (5-6x slowdown) in Spark 2.2 > -- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 100 examples, 1 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` > returns the correct result after calling `df.cache()`, so I'd suggest > replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib > algorithms (the same pattern shows up in LogisticRegression, > LinearRegression, and others). I've verified this behavior in [this > notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html] -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21799: --- Summary: KMeans performance regression (5-6x slowdown) in Spark 2.2 (was: KMeans Performance Regression (5-6x slowdown) in Spark 2.2) > KMeans performance regression (5-6x slowdown) in Spark 2.2 > -- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 100 examples, 1 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does > set the public `df.storageLevel` member properly, so I'd suggest replacing > instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms > (the same pattern shows up in LogisticRegression, LinearRegression, and > others). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21799) KMeans Performance Regression (5-6x slowdown) in Spark 2.2
[ https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21799: --- Description: I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does set the public `df.storageLevel` member properly, so I'd suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). was: I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does set the public `df.storageLevel` member properly, so I'd suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). > KMeans Performance Regression (5-6x slowdown) in Spark 2.2 > -- > > Key: SPARK-21799 > URL: https://issues.apache.org/jira/browse/SPARK-21799 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.2.0 >Reporter: Siddharth Murching > > I've been running KMeans performance tests using > [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have > noticed a regression (slowdowns of 5-6x) when running tests on large datasets > in Spark 2.2 vs 2.1. > The test params are: > * Cluster: 510 GB RAM, 16 workers > * Data: 100 examples, 1 features > After talking to [~josephkb], the issue seems related to the changes in > [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in > [this PR|https://github.com/apache/spark/pull/16295]. > It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so > `handlePersistence` is true even when KMeans is run on a cached DataFrame. > This unnecessarily causes another copy of the input dataset to be persisted. > As of Spark 2.1 ([JIRA > link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does > set the public `df.storageLevel` member properly, so I'd suggest replacing > instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms > (the same pattern shows up in LogisticRegression, LinearRegression, and > others). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21799) KMeans Performance Regression (5-6x slowdown) in Spark 2.2
Siddharth Murching created SPARK-21799: -- Summary: KMeans Performance Regression (5-6x slowdown) in Spark 2.2 Key: SPARK-21799 URL: https://issues.apache.org/jira/browse/SPARK-21799 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.2.0 Reporter: Siddharth Murching I've been running KMeans performance tests using [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 2.2 vs 2.1. The test params are: * Cluster: 510 GB RAM, 16 workers * Data: 100 examples, 1 features After talking to [~josephkb], the issue seems related to the changes in [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in [this PR|https://github.com/apache/spark/pull/16295]. `df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` is true even when KMeans is run on a cached DataFrame. This unnecessarily causes another copy of the input dataset to be persisted. As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does set the public `df.storageLevel` member properly, so I'd suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms (the same pattern shows up in LogisticRegression, LinearRegression, and others). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131868#comment-16131868 ] Siddharth Murching edited comment on SPARK-21770 at 8/18/17 7:48 AM: - Good question: * Predictions on all-zero input don't change (they remain 0 for RandomForestClassifier and DecisionTreeClassifier, which are the only models that call normalizeToProbabilitiesInPlace()) * This proposal seeks to make predicted probabilities more interpretable when raw model output is all-zero * Regardless, it currently seems impossible for normalizeToProbabilitiesInPlace to ever be called on all-zero input, since that'd mean a DecisionTree leaf node had a class count array (raw output) of all zeros. More detail: both DecisionTreeClassifier and RandomForestClassifier inherit Classifier's [implementation of raw2prediction()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L221], which just takes an argmax ([preferring earlier maximal entries|https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L176]) over the model's output vector. A raw model output of all-equal entries would result in a prediction of 0 either way. was (Author: siddharth murching): Good question: * Predictions on all-zero input don't change (they remain 0 for RandomForestClassifier and DecisionTreeClassifier, which are the only models that call normalizeToProbabilitiesInPlace()) * This proposal seeks to make predicted probabilities more interpretable when raw model output is all-zero * Regardless, it currently seems impossible for normalizeToProbabilitiesInPlace to ever be called on all-zero input, since that'd mean a DecisionTree leaf node had a class count array (raw output) of all zeros. Specifically, both DecisionTreeClassifier and RandomForestClassifier inherit Classifier's [implementation of raw2prediction()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L221], which just takes an argmax ([preferring earlier maximal entries|https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L176]) over the model's output vector. A raw model output of all-equal entries would result in a prediction of 0 either way. > ProbabilisticClassificationModel: Improve normalization of all-zero raw > predictions > --- > > Key: SPARK-21770 > URL: https://issues.apache.org/jira/browse/SPARK-21770 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Minor > > Given an n-element raw prediction vector of all-zeros, > ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output > a probability vector of all-equal 1/n entries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131868#comment-16131868 ] Siddharth Murching commented on SPARK-21770: Good question: * Predictions on all-zero input don't change (they remain 0 for RandomForestClassifier and DecisionTreeClassifier, which are the only models that call normalizeToProbabilitiesInPlace()) * This proposal seeks to make predicted probabilities more interpretable when raw model output is all-zero * Regardless, it currently seems impossible for normalizeToProbabilitiesInPlace to ever be called on all-zero input, since that'd mean a DecisionTree leaf node had a class count array (raw output) of all zeros. Specifically, both DecisionTreeClassifier and RandomForestClassifier inherit Classifier's [implementation of raw2prediction()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L221], which just takes an argmax ([preferring earlier maximal entries|https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L176]) over the model's output vector. A raw model output of all-equal entries would result in a prediction of 0 either way. > ProbabilisticClassificationModel: Improve normalization of all-zero raw > predictions > --- > > Key: SPARK-21770 > URL: https://issues.apache.org/jira/browse/SPARK-21770 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Minor > > Given an n-element raw prediction vector of all-zeros, > ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output > a probability vector of all-equal 1/n entries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-21770: --- Description: Given an n-element raw prediction vector of all-zeros, ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a probability vector of all-equal 1/n entries (was: Given a raw prediction vector of all-zeros, ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a probability vector that predicts each class with equal probability (1 / numClasses).) > ProbabilisticClassificationModel: Improve normalization of all-zero raw > predictions > --- > > Key: SPARK-21770 > URL: https://issues.apache.org/jira/browse/SPARK-21770 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Priority: Minor > > Given an n-element raw prediction vector of all-zeros, > ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output > a probability vector of all-equal 1/n entries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
Siddharth Murching created SPARK-21770: -- Summary: ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions Key: SPARK-21770 URL: https://issues.apache.org/jira/browse/SPARK-21770 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Siddharth Murching Priority: Minor Given a raw prediction vector of all-zeros, ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a probability vector that predicts each class with equal probability (1 / numClasses). -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434154#comment-15434154 ] Siddharth Murching commented on SPARK-3162: --- Here's a design doc with proposed changes - any comments/feedback are much appreciated :) https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/ > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Murching updated SPARK-3162: -- Comment: was deleted (was: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing ) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031 ] Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) Design doc link: [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] was (Author: siddharth murching): Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031 ] Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] was (Author: siddharth murching): Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031 ] Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM: Here's a design doc with proposed changes - any comments/feedback are much appreciated :) https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing was (Author: siddharth murching): Here's a design doc with proposed changes - any comments/feedback are much appreciated :) Design doc link: [Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing] > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031 ] Siddharth Murching commented on SPARK-3162: --- Here's a design doc with proposed changes - any comments/feedback are much appreciated :) [Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing) > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible
[ https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414484#comment-15414484 ] Siddharth Murching commented on SPARK-3162: --- [~josephkb] I'd like to work on this if possible - thanks! > Train DecisionTree locally when possible > > > Key: SPARK-3162 > URL: https://issues.apache.org/jira/browse/SPARK-3162 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Critical > > Improvement: communication > Currently, every level of a DecisionTree is trained in a distributed manner. > However, at deeper levels in the tree, it is possible that a small set of > training data will be matched with any given node. If the node’s training > data can fit on one machine’s memory, it may be more efficient to shuffle the > data and do local training for the rest of the subtree rooted at that node. > Note: It is possible that local training would become possible at different > levels in different branches of the tree. There are multiple options for > handling this case: > (1) Train in a distributed fashion until all remaining nodes can be trained > locally. This would entail training multiple levels at once (locally). > (2) Train branches locally when possible, and interleave this with > distributed training of the other branches. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org