Here's my thought for what might work well. Comments very welcome. We construct k decision trees (where k might be 20-100). Each tree has depth roughly 10-30. Rather than construct each tree in a distributed fashion, construct each tree locally by using multiple passes over the bagged sample for that tree (construct in BFS manner, assuming the tree can fit in memory). Do this in parallel k times. I know this is perhaps not the "mapreduce" way, but I'd be interested to hear alternatives. It seems a bad idea to shuffle the data around on every BFS iteration per tree.
I'm not planning to work on this right now, but would like to figure out if Mahout is a good platform to work with if I want such an algorithm. Andy On 26 January 2013 12:39, Andy Twigg <[email protected]> wrote: > Hi, > > I'd like to grow some decision trees efficiently where the dataset is > too large for memory. The decision tree will almost certainly fit in > memory. I'm a bit surprised that Mahout doesn't have an algorithm to > construct decision trees when the data doesn't fit in memory, given > its focus on large scale! > > Can anyone recommend an algorithm that does grow random forests 1) > when the data is out-of-core, 2) with multi-pass streaming access to > data, 3) is distributed, [optionally: 4) is incremental ] ? I am happy > to help contribute such an algorithm to mahout, but would prefer it if > one already exists. Does anyone have familiarity with the algorithm > used by bigML.com ? > > Andy > > > > > On 26 January 2013 00:29, Marty Kube > <[email protected]> wrote: >> Hey Andy, >> >> What is the use case that is driving your question? >> Are you looking at the training phase - I didn't realise that one needed to >> keep the data in memory. >> I have a use case where even keeping the trees, much less the data, in >> memory during classification is an issue. >> >> Ted, do you have some references on the algorithms below? I've done some >> work on using mmap trees during classification. I'd be happy to contribute >> along those lines. >> >> >> On 01/25/2013 04:52 PM, Ted Dunning wrote: >>> >>> Hey Andy, >>> >>> There are no plans for this. You are correct that multiple passes aren't >>> too difficult, but they do go against the standard map-reduce paradigm a >>> bit if you want to avoid iterative map-reduce. >>> >>> It definitely would be nice to have a really competitive random forest >>> implementation that uses the global accumulator style plus long-lived >>> mappers. The basic idea would be to use the same sort of tricks that >>> Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then >>> have them asynchronously talk to a tree repository. >>> >>> On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> I'm new to this list so I apologise if this is covered elsewhere (but >>>> I couldn't find it..) >>>> >>>> I'm looking at the Random Forests implementations, both mapreduce >>>> ("partial") and non-distributed. Both appear to require the data >>>> loaded into memory. Random forests should be straightforward to >>>> construct with multiple passes through the data without storing the >>>> data in memory. Is there such an implementation in Mahout? If not, is >>>> there a ticket/plan ? >>>> >>>> Thanks, >>>> Andy >>>> >>>> >>>> -- >>>> Dr Andy Twigg >>>> Junior Research Fellow, St Johns College, Oxford >>>> Room 351, Department of Computer Science >>>> http://www.cs.ox.ac.uk/people/andy.twigg/ >>>> [email protected] | +447799647538 >>>> >> > > > > -- > Dr Andy Twigg > Junior Research Fellow, St Johns College, Oxford > Room 351, Department of Computer Science > http://www.cs.ox.ac.uk/people/andy.twigg/ > [email protected] | +447799647538 -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538
