Re: [gsoc] random forests
Here is a draft of my proposal ** Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests Student: AbdelHakim Deneche Student e-mail: ... Student Major: Phd in Computer Science Student Degree: Master in Computer Science Student Graduation: Spring 2011 Organization: The Apache Software Foundation Assigned Mentor: Abstract: My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned. Detailed Description: This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki [RandomForests]. Given the size of the dataset, two distributed implementation are possible: 1. The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm as it is. It is also of great interest if the user wants to try different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory. 2. The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster. Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation. Working Plan and list of deliverables Must-Have: 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be re-used, at any time, to classify new cases . At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets. 2. Study the effects of large datasets on the reference implementation . This step should guide our choice of the proper parallel implementation 3. Parallel implementation, choose one of the following: 3a. Parallel implementation A . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation. . The reducer store the trees in the RF file, and merges the oob error estimations. 3b. Parallel implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. Should-Have: 4. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset. . This can make a good demo for Mahout . Amazon has put some interesting datasets to play with [PublicDatasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example. . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix Reader). Wanna-Have: 5. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation Additional Information: I am a PhD student at the University Mentouri of Constantine. My primary research goal is a framework to help build Intelligent Adaptive Systems. For the purpose of my Master, I worked on
Re: [gsoc] random forests
Deneche, I don't see your application on the GSOC web site. Nor on the apache wiki. Time is running out and I would hate to not see you in the program. Is it just that I can't see the application yet? On Tue, Mar 31, 2009 at 1:05 PM, deneche abdelhakim a_dene...@yahoo.frwrote: Here is a draft of my proposal ** Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests Student: AbdelHakim Deneche Student e-mail: ... Student Major: Phd in Computer Science Student Degree: Master in Computer Science Student Graduation: Spring 2011 Organization: The Apache Software Foundation Assigned Mentor: Abstract: My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned. Detailed Description: This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki [RandomForests]. Given the size of the dataset, two distributed implementation are possible: 1. The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm as it is. It is also of great interest if the user wants to try different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory. 2. The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster. Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation. Working Plan and list of deliverables Must-Have: 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be re-used, at any time, to classify new cases . At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets. 2. Study the effects of large datasets on the reference implementation . This step should guide our choice of the proper parallel implementation 3. Parallel implementation, choose one of the following: 3a. Parallel implementation A . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation. . The reducer store the trees in the RF file, and merges the oob error estimations. 3b. Parallel implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. Should-Have: 4. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset. . This can make a good demo for Mahout . Amazon has put some interesting datasets to play with [PublicDatasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example. . At this stage it may be useful to implement [MAHOUT-71] (Dataset to Matrix Reader). Wanna-Have: 5. If there is still time,
Re: [gsoc] random forests
Thank you for your answer, it just made me aware of many hidden-possible-future problems with my implementation. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. by out-of-core you mean the builder can fetch the data directly from a file instead of working from in-memory only (?) One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. I was planning to distribute the dataset files to all workers using Hadoop's DistributedCache. I think that a streaming implementation is feasible, the basic tree building algorithm (described here http://cwiki.apache.org/MAHOUT/random-forests.html) would have to stream through the data (either in-memory or from a file) for each node of the tree. During this pass, it computes the information gain (IG) for the selected variables. This algorithm could be improved to compute the IG's for a list of nodes, thus reducing the total number of passes through the data. When building the forest, the list of nodes comes from all the trees built by the mapper. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. I'll have to run some tests before answering this question, but I think that the memory usage of the improved algorithm (described above) will mainly be needed to store the IG's computations (variable probabilities...). One way to limit the memory usage is to limit the number of tree-nodes computed at each data pass. Increasing this limit should reduce the data passes but increase the memory usage, and vice versa. There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. abdelHakim --- En date de : Lun 30.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Lundi 30 Mars 2009, 0h59 I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote: My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [gsoc] random forests
Indeed. And those datasets exist. It is also plausible that this full data scan approach will fail when you want the forest building to take less time. It is also plausible that a full data scan approach fails to improve enough on a non-parallel implementation. This would happen if a significantly large fraction of the entire forest could be built on a single node. That would happen if the CPU requirements for forest building are overshadowed by the I/O cost of scanning the data set. This would imply that there is a small limit to the amount of parallelism that would help. You will know much more about this after you finish the non-parallel implementation than either of us knows now. On Mon, Mar 30, 2009 at 7:24 AM, deneche abdelhakim a_dene...@yahoo.frwrote: There is still one case that this approach, even out-of-core, cannot handle: very large datasets that cannot fit in the node hard-drive, and thus must be distributed across the cluster. -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
I suggest that we all learn from the experience you are about to have on the reference implementation. And, yes, I did mean the reference implementation when I said non-parallel. Thanks for clarifying. On Mon, Mar 30, 2009 at 10:45 AM, deneche abdelhakim a_dene...@yahoo.frwrote: What do you suggest ? And just to make sure, by 'non-paralel implementation' you mean the reference implementation, right ? -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
I have two answers for you. The first is that for any given application, the odds that the data will not fit in a single machine are small, especially if you have an out-of-core tree builder. Really, really big datasets are increasingly common, but are still a small minority of all datasets. The second answer is that the odds that SOME mahout application will be too large for a single node are quite high. These aren't contradictory. They just describe the long-tail nature of problem sizes. One question I have about your plan is whether your step (1) involves building trees or forests only from data held in memory or whether it can be adapted to stream through the data (possibly several times). If a streaming implementation is viable, then it may well be that performance is still quite good for small datasets due to buffering. If streaming works, then a single node will be able to handle very large datasets but will just be kind of slow. As you point out, that can be remedied trivially. Another way to put this is that the key question is how single node computation scales with input size. If the scaling is relatively linear with data size, then your approach (3) will work no matter the data size. If scaling shows an evil memory size effect, then your approach (2) would be required for large data sets. On Sat, Mar 28, 2009 at 8:14 AM, deneche abdelhakim a_dene...@yahoo.frwrote: My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)
Re: [gsoc] random forests
you should read in . 2a . This implementation is, relatively, easy given... --- En date de : Sam 28.3.09, deneche abdelhakim a_dene...@yahoo.fr a écrit : De: deneche abdelhakim a_dene...@yahoo.fr Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Samedi 28 Mars 2009, 16h14 I'm actually writing my working plan, and it looks like this: * 1. reference implementation of Random/Regression Forests Building Algorithm: . Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This basic algorithm is repeated for each tree of the forest. . The forest is stored in a file, this way it can be used later to classify new cases 2a. distributed Implementation A: . When the dataset can be replicated to all computing nodes. . Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. . This implementation is, relatively, given that the reference implementation is available, because each mapper runs the basic building algorithm as it is. 2b. Distributed Implementation B: . When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster. . Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. . In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm. 3. Run the Random Forest with a real dataset on EC2: . This step is important, because running the RF on a local dual core machine is way different from running it on a real cluster with a real dataset. . This can make for a good demo for Mahout 4. If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation * It is clear from the plan that I won't be able to do all those steps, and in some way I must choose only one implementation (2a or 2b) to do. The first implementation should take less time to implement than 2b and I'm quite sure I can go up to the 4th step, adding other features to the RF. BUT the second implementation is the only one capable of dealing with very large distributed datasets. My question is : when Mahout.RF will be used in a real application, what are the odds that the dataset will be so large that it can't fit on every machine of the cluster ? the answer to this question should help me decide which implementation I'll choose. --- En date de : Dim 22.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Dimanche 22 Mars 2009, 0h36 Great expression! You may be right about the nose-bleed tendency between the two methods. On Sat, Mar 21, 2009 at 4:46 AM, deneche abdelhakim a_dene...@yahoo.frwrote: I can't find a no-nose-bleeding algorithm -- Ted Dunning, CTO DeepDyve
Re: [gsoc] random forests
Yeah, Breinman states that at each node, m variables are selected at random out of the M I modified the wiki page, in LearnUnprunedTree(X,Y) which builds iteratively a node at a time, I added this line: select m variables at random out of the M variables before searching the best split For j = 1 .. m ... --- En date de : Lun 16.3.09, Ted Dunning ted.dunn...@gmail.com a écrit : De: Ted Dunning ted.dunn...@gmail.com Objet: Re: [gsoc] random forests À: mahout-dev@lucene.apache.org Date: Lundi 16 Mars 2009, 7h26 Nice writeup. One thing that I was confused about for a long time is whether the choice of variables to use for splits is chosen once per tree or again at each split. I think that the latter interpretation is actually the correct one. You should check my thought. On Sun, Mar 15, 2009 at 1:53 AM, deneche abdelhakim a_dene...@yahoo.frwrote: I added a page to the wiki that describes how to build a random forest and how to use it to classify new cases. -- Ted Dunning, CTO DeepDyve