[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2019-05-20 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-3717:

Labels: bulk-closed  (was: )

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>  Labels: bulk-closed
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2016-07-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Assignee: (was: Joseph K. Bradley)

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2015-11-04 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3717:
-
Target Version/s:   (was: 1.6.0)

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2015-07-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Target Version/s: 1.6.0

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2015-01-26 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Target Version/s:   (was: 1.3.0)

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Assignee: Joseph K. Bradley

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-11-24 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3717:
-
Target Version/s: 1.3.0

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-09-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Description: 
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning instances cost when:
* D * (M * 8 + N)  2^D * M * B * C * 8
* D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
right hand side)
* N  [ 2^D * M * B * C * 8 ] / D

Example: many instances:
* 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
* Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
* Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8


  was:
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning 

[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-09-29 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Description: 
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Each worker stores:
** a subset of columns (i.e., a subset of features).  If a worker stores 
feature j, then the worker stores the feature value for all instances (i.e., 
the whole column).
** all labels
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning instances cost when:
* D * (M * 8 + N)  2^D * M * B * C * 8
* D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
right hand side)
* N  [ 2^D * M * B * C * 8 ] / D

Example: many instances:
* 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
* Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
* Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8


  was:
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations,