[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15392775#comment-15392775 ]
Joseph K. Bradley commented on SPARK-3717: ------------------------------------------ Note: Initial code for this is available here: [https://spark-packages.org/package/fabuzaid21/yggdrasil] > DecisionTree, RandomForest: Partition by feature > ------------------------------------------------ > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Joseph K. Bradley > Assignee: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org