[jira] [Commented] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338360#comment-16338360
 ] 

Siddharth Murching commented on SPARK-23205:


Working on a PR to address this issue

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16338360#comment-16338360
 ] 

Siddharth Murching edited comment on SPARK-23205 at 1/24/18 10:40 PM:
--

I'm working on a PR to address this issue if that's alright :)


was (Author: siddharth murching):
Working on a PR to address this issue

> ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel 
> images
> 
>
> Key: SPARK-23205
> URL: https://issues.apache.org/jira/browse/SPARK-23205
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Critical
>
> When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
> constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
>  that sets alpha = 255, even for four-channel images.
> See the offending line here: 
> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172
> A fix is to simply update the line to: 
> val color = new Color(img.getRGB(w, h), nChannels == 4)
> instead of
> val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23205) ImageSchema.readImages incorrectly sets alpha channel to 255 for four-channel images

2018-01-24 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-23205:
--

 Summary: ImageSchema.readImages incorrectly sets alpha channel to 
255 for four-channel images
 Key: SPARK-23205
 URL: https://issues.apache.org/jira/browse/SPARK-23205
 Project: Spark
  Issue Type: Bug
  Components: ML, MLlib
Affects Versions: 2.3.0
Reporter: Siddharth Murching


When parsing raw image data in ImageSchema.decode(), we use a [java.awt.Color 
constructor|https://docs.oracle.com/javase/7/docs/api/java/awt/Color.html#Color(int)]
 that sets alpha = 255, even for four-channel images.

See the offending line here: 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/image/ImageSchema.scala#L172

A fix is to simply update the line to: 

val color = new Color(img.getRGB(w, h), nChannels == 4)


instead of

val color = new Color(img.getRGB(w, h))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2017-10-04 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192201#comment-16192201
 ] 

Siddharth Murching edited comment on SPARK-3162 at 10/4/17 11:35 PM:
-

Commenting here to note that I'd like to resume work on this issue; I've made a 
new PR^


was (Author: siddharth murching):
Commenting here to note that I'm resuming work on this issue; I've made a new 
PR^

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2017-10-04 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16192201#comment-16192201
 ] 

Siddharth Murching commented on SPARK-3162:
---

Commenting here to note that I'm resuming work on this issue; I've made a new 
PR^

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-11 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16161634#comment-16161634
 ] 

Siddharth Murching commented on SPARK-21972:


Link to old PR containing work on this: 
https://github.com/apache/spark/pull/17014

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-11 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Comment: was deleted

(was: This issue was originally being worked on in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014])

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-11 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624
 ] 

Siddharth Murching edited comment on SPARK-21972 at 9/11/17 5:22 PM:
-

This issue was originally being worked on in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]


was (Author: siddharth murching):
This issue is being worked on in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately 
([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately 
> ([SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) 
> check the persistence level of the input dataset but not any of its parents. 
> These issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624
 ] 

Siddharth Murching commented on SPARK-21972:


Work has already begun on this in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}} (algorithms will not 
try to persist uncached input).


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16160624#comment-16160624
 ] 

Siddharth Murching edited comment on SPARK-21972 at 9/11/17 3:46 AM:
-

This issue is being worked on in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]


was (Author: siddharth murching):
Work has already begun on this in this PR: 
[https://github.com/apache/spark/pull/17014|https://github.com/apache/spark/pull/17014]

> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents. These 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}} (algorithms will not 
try to persist uncached input).

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents. These 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}} (algorithms will 
> not try to persist uncached input).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call {{cache()}} on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean {{handlePersistence}} param 
(org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
should try to cache un-cached input data. {{handlePersistence}} will be 
{{true}} by default, corresponding to existing behavior (always persisting 
uncached input), but users can achieve finer-grained control over input 
persistence by setting {{handlePersistence}} to {{false}}.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call {{cache()}} on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents; these 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean {{handlePersistence}} param 
> (org.apache.spark.ml.param) so that users can specify whether an ML algorithm 
> should try to cache un-cached input data. {{handlePersistence}} will be 
> {{true}} by default, corresponding to existing behavior (always persisting 
> uncached input), but users can achieve finer-grained control over input 
> persistence by setting {{handlePersistence}} to {{false}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance. 
Unfortunately, these algorithms a) check input persistence inaccurately (as 
described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) 
and b) check the persistence level of the input dataset but not any of its 
parents; both of these issues can result in unwanted double-caching of input 
data & degraded performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call `cache()` on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents; these 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].
> This ticket proposes adding a boolean `handlePersistence` param 
> (org.apache.spark.ml.param) to the abovementioned estimators so that users 
> can specify whether an ML algorithm should try to cache un-cached input data. 
> `handlePersistence` will be `true` by default, corresponding to existing 
> behavior (always persisting uncached input), but users can achieve 
> finer-grained control over input persistence by setting `handlePersistence` 
> to `false`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21972:
---
Description: 
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.

  was:
Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance.

Unfortunately, these algorithms a) check input persistence inaccurately (see 
[SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
the persistence level of the input dataset but not any of its parents; these 
issues can result in unwanted double-caching of input data & degraded 
performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.


> Allow users to control input data persistence in ML Estimators via a 
> handlePersistence ml.Param
> ---
>
> Key: SPARK-21972
> URL: https://issues.apache.org/jira/browse/SPARK-21972
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, 
> etc) call `cache()` on uncached input datasets to improve performance.
> Unfortunately, these algorithms a) check input persistence inaccurately (see 
> [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) and b) check 
> the persistence level of the input dataset but not any of its parents; these 
> issues can result in unwanted double-caching of input data & degraded 
> performance (see 
> [SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799]).
> This ticket proposes adding a boolean `handlePersistence` param 
> (org.apache.spark.ml.param) to the abovementioned estimators so that users 
> can specify whether an ML algorithm should try to cache un-cached input data. 
> `handlePersistence` will be `true` by default, corresponding to existing 
> behavior (always persisting uncached input), but users can achieve 
> finer-grained control over input persistence by setting `handlePersistence` 
> to `false`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21972) Allow users to control input data persistence in ML Estimators via a handlePersistence ml.Param

2017-09-10 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-21972:
--

 Summary: Allow users to control input data persistence in ML 
Estimators via a handlePersistence ml.Param
 Key: SPARK-21972
 URL: https://issues.apache.org/jira/browse/SPARK-21972
 Project: Spark
  Issue Type: Improvement
  Components: ML, MLlib
Affects Versions: 2.2.0
Reporter: Siddharth Murching


Several Spark ML algorithms (LogisticRegression, LinearRegression, KMeans, etc) 
call `cache()` on uncached input datasets to improve performance. 
Unfortunately, these algorithms a) check input persistence inaccurately (as 
described in [SPARK-18608|https://issues.apache.org/jira/browse/SPARK-18608]) 
and b) check the persistence level of the input dataset but not any of its 
parents; both of these issues can result in unwanted double-caching of input 
data & degraded performance (see 
[SPARK-21799|https://issues.apache.org/jira/browse/SPARK-21799].

This ticket proposes adding a boolean `handlePersistence` param 
(org.apache.spark.ml.param) to the abovementioned estimators so that users can 
specify whether an ML algorithm should try to cache un-cached input data. 
`handlePersistence` will be `true` by default, corresponding to existing 
behavior (always persisting uncached input), but users can achieve 
finer-grained control over input persistence by setting `handlePersistence` to 
`false`.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-21 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21799:
---
Description: 
I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
`handlePersistence` is true even when KMeans is run on a cached DataFrame. This 
unnecessarily causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.storageLevel` returns the correct result after calling `df.cache()`, so I'd 
suggest replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` 
in MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others). I've verified this behavior in [this 
notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



  was:
I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
`handlePersistence` is true even when KMeans is run on a cached DataFrame. This 
unnecessarily causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.storageLevel` returns the correct result after calling `df.cache()`, so I'd 
suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in 
MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others). I've verified this behavior in [this 
notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]




> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.getStorageLevel` with df.storageLevel` in 
> MLlib algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-21 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21799:
---
Description: 
I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
`handlePersistence` is true even when KMeans is run on a cached DataFrame. This 
unnecessarily causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.storageLevel` returns the correct result after calling `df.cache()`, so I'd 
suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in 
MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others). I've verified this behavior in [this 
notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



  was:
I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
`handlePersistence` is true even when KMeans is run on a cached DataFrame. This 
unnecessarily causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.cache()` does set the public `df.storageLevel` member properly, so I'd 
suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in 
MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others).




> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.storageLevel` 
> returns the correct result after calling `df.cache()`, so I'd suggest 
> replacing instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib 
> algorithms (the same pattern shows up in LogisticRegression, 
> LinearRegression, and others). I've verified this behavior in [this 
> notebook|https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5211178207246023/950505630032626/7788830288800223/latest.html]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21799) KMeans performance regression (5-6x slowdown) in Spark 2.2

2017-08-21 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21799:
---
Summary: KMeans performance regression (5-6x slowdown) in Spark 2.2  (was: 
KMeans Performance Regression (5-6x slowdown) in Spark 2.2)

> KMeans performance regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does 
> set the public `df.storageLevel` member properly, so I'd suggest replacing 
> instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms 
> (the same pattern shows up in LogisticRegression, LinearRegression, and 
> others).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21799) KMeans Performance Regression (5-6x slowdown) in Spark 2.2

2017-08-21 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21799:
---
Description: 
I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
`handlePersistence` is true even when KMeans is run on a cached DataFrame. This 
unnecessarily causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.cache()` does set the public `df.storageLevel` member properly, so I'd 
suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in 
MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others).



  was:
I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

`df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` 
is true even when KMeans is run on a cached DataFrame. This unnecessarily 
causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.cache()` does set the public `df.storageLevel` member properly, so I'd 
suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in 
MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others).


> KMeans Performance Regression (5-6x slowdown) in Spark 2.2
> --
>
> Key: SPARK-21799
> URL: https://issues.apache.org/jira/browse/SPARK-21799
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Siddharth Murching
>
> I've been running KMeans performance tests using 
> [spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have 
> noticed a regression (slowdowns of 5-6x) when running tests on large datasets 
> in Spark 2.2 vs 2.1.
> The test params are:
> * Cluster: 510 GB RAM, 16 workers
> * Data: 100 examples, 1 features
> After talking to [~josephkb], the issue seems related to the changes in 
> [SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
> [this PR|https://github.com/apache/spark/pull/16295].
> It seems `df.cache()` doesn't set the storageLevel of `df.rdd`, so 
> `handlePersistence` is true even when KMeans is run on a cached DataFrame. 
> This unnecessarily causes another copy of the input dataset to be persisted.
> As of Spark 2.1 ([JIRA 
> link|https://issues.apache.org/jira/browse/SPARK-16063]) `df.cache()` does 
> set the public `df.storageLevel` member properly, so I'd suggest replacing 
> instances of `df.rdd.storageLevel` with df.storageLevel` in MLlib algorithms 
> (the same pattern shows up in LogisticRegression, LinearRegression, and 
> others).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21799) KMeans Performance Regression (5-6x slowdown) in Spark 2.2

2017-08-21 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-21799:
--

 Summary: KMeans Performance Regression (5-6x slowdown) in Spark 2.2
 Key: SPARK-21799
 URL: https://issues.apache.org/jira/browse/SPARK-21799
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.2.0
Reporter: Siddharth Murching


I've been running KMeans performance tests using 
[spark-sql-perf|https://github.com/databricks/spark-sql-perf/] and have noticed 
a regression (slowdowns of 5-6x) when running tests on large datasets in Spark 
2.2 vs 2.1.

The test params are:
* Cluster: 510 GB RAM, 16 workers
* Data: 100 examples, 1 features

After talking to [~josephkb], the issue seems related to the changes in 
[SPARK-18356|https://issues.apache.org/jira/browse/SPARK-18356] introduced in 
[this PR|https://github.com/apache/spark/pull/16295].

`df.cache()` doesn't set the storageLevel of `df.rdd`, so `handlePersistence` 
is true even when KMeans is run on a cached DataFrame. This unnecessarily 
causes another copy of the input dataset to be persisted.

As of Spark 2.1 ([JIRA link|https://issues.apache.org/jira/browse/SPARK-16063]) 
`df.cache()` does set the public `df.storageLevel` member properly, so I'd 
suggest replacing instances of `df.rdd.storageLevel` with df.storageLevel` in 
MLlib algorithms (the same pattern shows up in LogisticRegression, 
LinearRegression, and others).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-18 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131868#comment-16131868
 ] 

Siddharth Murching edited comment on SPARK-21770 at 8/18/17 7:48 AM:
-

Good question:

* Predictions on all-zero input don't change (they remain 0 for 
RandomForestClassifier and DecisionTreeClassifier, which are the only models 
that call normalizeToProbabilitiesInPlace())
* This proposal seeks to make predicted probabilities more interpretable when 
raw model output is all-zero
* Regardless, it currently seems impossible for normalizeToProbabilitiesInPlace 
to ever be called on all-zero input, since that'd mean a DecisionTree leaf node 
had a class count array (raw output) of all zeros.

More detail: both DecisionTreeClassifier and RandomForestClassifier inherit 
Classifier's [implementation of 
raw2prediction()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L221],
 which just takes an argmax ([preferring earlier maximal 
entries|https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L176])
 over the model's output vector. A raw model output of all-equal entries would 
result in a prediction of 0 either way.



was (Author: siddharth murching):
Good question:

* Predictions on all-zero input don't change (they remain 0 for 
RandomForestClassifier and DecisionTreeClassifier, which are the only models 
that call normalizeToProbabilitiesInPlace())
* This proposal seeks to make predicted probabilities more interpretable when 
raw model output is all-zero
* Regardless, it currently seems impossible for normalizeToProbabilitiesInPlace 
to ever be called on all-zero input, since that'd mean a DecisionTree leaf node 
had a class count array (raw output) of all zeros.

Specifically, both DecisionTreeClassifier and RandomForestClassifier inherit 
Classifier's [implementation of 
raw2prediction()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L221],
 which just takes an argmax ([preferring earlier maximal 
entries|https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L176])
 over the model's output vector. A raw model output of all-equal entries would 
result in a prediction of 0 either way.


> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-18 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16131868#comment-16131868
 ] 

Siddharth Murching commented on SPARK-21770:


Good question:

* Predictions on all-zero input don't change (they remain 0 for 
RandomForestClassifier and DecisionTreeClassifier, which are the only models 
that call normalizeToProbabilitiesInPlace())
* This proposal seeks to make predicted probabilities more interpretable when 
raw model output is all-zero
* Regardless, it currently seems impossible for normalizeToProbabilitiesInPlace 
to ever be called on all-zero input, since that'd mean a DecisionTree leaf node 
had a class count array (raw output) of all zeros.

Specifically, both DecisionTreeClassifier and RandomForestClassifier inherit 
Classifier's [implementation of 
raw2prediction()|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala#L221],
 which just takes an argmax ([preferring earlier maximal 
entries|https://github.com/apache/spark/blob/master/mllib-local/src/main/scala/org/apache/spark/ml/linalg/Vectors.scala#L176])
 over the model's output vector. A raw model output of all-equal entries would 
result in a prediction of 0 either way.


> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-17 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-21770:
---
Description: Given an n-element raw prediction vector of all-zeros, 
ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a 
probability vector of all-equal 1/n entries  (was: Given a raw prediction 
vector of all-zeros, 
ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a 
probability vector that predicts each class with equal probability (1 / 
numClasses).)

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-17 Thread Siddharth Murching (JIRA)
Siddharth Murching created SPARK-21770:
--

 Summary: ProbabilisticClassificationModel: Improve normalization 
of all-zero raw predictions
 Key: SPARK-21770
 URL: https://issues.apache.org/jira/browse/SPARK-21770
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Siddharth Murching
Priority: Minor


Given a raw prediction vector of all-zeros, 
ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output a 
probability vector that predicts each class with equal probability (1 / 
numClasses).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434154#comment-15434154
 ] 

Siddharth Murching commented on SPARK-3162:
---

Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)

https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Murching updated SPARK-3162:
--
Comment: was deleted

(was: Here's a design doc with proposed changes - any comments/feedback are 
much appreciated :)
https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing
)

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM:


Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
Design doc link: 
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]



was (Author: siddharth murching):
Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM:


Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]



was (Author: siddharth murching):
Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing)


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching edited comment on SPARK-3162 at 8/24/16 1:37 AM:


Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing



was (Author: siddharth murching):
Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
Design doc link: 
[Link|https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing]


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-08-23 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15434031#comment-15434031
 ] 

Siddharth Murching commented on SPARK-3162:
---

Here's a design doc with proposed changes - any comments/feedback are much 
appreciated :)
[Link](https://docs.google.com/document/d/1baU5KeorrmLpC4EZoqLuG-E8sUJqmdELLbr8o6wdbVM/edit?usp=sharing)


> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3162) Train DecisionTree locally when possible

2016-08-09 Thread Siddharth Murching (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15414484#comment-15414484
 ] 

Siddharth Murching commented on SPARK-3162:
---

[~josephkb] I'd like to work on this if possible - thanks!

> Train DecisionTree locally when possible
> 
>
> Key: SPARK-3162
> URL: https://issues.apache.org/jira/browse/SPARK-3162
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Improvement: communication
> Currently, every level of a DecisionTree is trained in a distributed manner.  
> However, at deeper levels in the tree, it is possible that a small set of 
> training data will be matched with any given node.  If the node’s training 
> data can fit on one machine’s memory, it may be more efficient to shuffle the 
> data and do local training for the rest of the subtree rooted at that node.
> Note: It is possible that local training would become possible at different 
> levels in different branches of the tree.  There are multiple options for 
> handling this case:
> (1) Train in a distributed fashion until all remaining nodes can be trained 
> locally.  This would entail training multiple levels at once (locally).
> (2) Train branches locally when possible, and interleave this with 
> distributed training of the other branches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org