Re: Out-of-core random forest implementation

Andy Twigg Mon, 28 Jan 2013 08:57:44 -0800

Ted,

Sorry, I don't understand. Are you suggesting that a single decision
tree can be built efficiently in a distributed fashion?


The following 2 ways seem like the naive ways of doing this:
1) each machine constructs one node of the tree, by scanning all the
data, filtering those that don't get to its node (this is expensive),
computing the split and then writing the split out. This requires a
pass through the data for each node of the tree.
2) as 1), except that each machine writes out the filtered data after
its node in the tree. This requires less scanning (as much as the
local case I described earlier), but lots of data movement.

Do you have an alternative method?

Andy


On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote:
> IF we have a step which permutes data (once) then I doubt that
> redistribution is necessary.  At that point the randomness consists of
> building trees based on different variable subsets and data subsets.  The
> original random forests only split on variable subsets.  How much this
> matters is an open question.
>
> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote:
>
>> It sounds OK to me. Yes, I don't think you want to try to parallelize
>> the building of each tree just to build it off all the data. It will
>> be too slow.
>>
>> I imagine the game is to parallelize such that each worker has 1/Nth
>> of the data, where N is as low as you can manage. Each worker is
>> building one or more trees from its slice. Then iterate, and split up
>> the data differently. Each worker can now cross-validate previous
>> trees based on the new slice, or even build new ones, and iterate
>> again.
>>
>> I'm guessing this process will tend to need to build more trees than
>> usual. I think it's also possible that it will want to try to discard
>> trees with high out-of-bag error.
>>
>> Just riffing. I think this is some potential green-field territory to
>> really think about how you build trees when it doesn't nearly fit in
>> memory, and what shape of computation gives a really nice tradeoff
>> between speed and accuracy. I don't think the conventional approach is
>> that point but it's more like the above.
>>
>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]> wrote:
>> > Here's my thought for what might work well. Comments very welcome.
>> >
>> > We construct k decision trees (where k might be 20-100). Each tree has
>> > depth roughly 10-30. Rather than construct each tree in a distributed
>> > fashion, construct each tree locally by using multiple passes over the
>> > bagged sample for that tree (construct in BFS manner, assuming the
>> > tree can fit in memory). Do this in parallel k times. I know this is
>> > perhaps not the "mapreduce" way, but I'd be interested to hear
>> > alternatives. It seems a bad idea to shuffle the data around on every
>> > BFS iteration per tree.
>> >
>> > I'm not planning to work on this right now, but would like to figure
>> > out if Mahout is a good platform to work with if I want such an
>> > algorithm.
>> >
>> > Andy
>>



-- 
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Out-of-core random forest implementation

Reply via email to