On 01/28/2013 02:33 PM, Ted Dunning wrote:
I think I was suggesting something weaker.

I was suggesting that trees get built against a portion of the data and
each node builds some number of trees against just the data it sees.  This
is in the spirit of random forests, but not the letter.
I'm looking at the Mahout partial implementation. It appears to me that some number of trees get built against the portion of the data each node sees based on the map reduce input split.
Am I wrong here?  If not, Mahout already has an out-of-core implementation.

BTW - where did the name Partial Implementation come from? Is this partially completed work :-)

Another option is that all the nodes start some trees based on their own
data and spit the set of all such trees to a central (possibly distributed
store).  They also grab some number of trees back down from the central
store and try elaborating those trees using their own data.  These will
also be sent back to the central store.  This is closer to random forests.

A third option is like option two, but the trees that get pulled down and
spit back up collect statistics until enough nodes report such stats at
which point a new branch is added and the stats are reset.  This is very
close to real random forests.  It should also be efficient because each
node is passing many trees across its own data in each pass.  If that data
fits in memory (even in file cache) then the pass will be very fast.  If it
doesn't fit in memory, then it will run as fast as the disk will let it
run.  Allowing a tree to commit with only 90% coverage should allow good
robustness against laggards while doing almost exactly the same thing as
the real random forest algorithm.

On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]> wrote:

Ted,

Sorry, I don't understand. Are you suggesting that a single decision
tree can be built efficiently in a distributed fashion?

The following 2 ways seem like the naive ways of doing this:
1) each machine constructs one node of the tree, by scanning all the
data, filtering those that don't get to its node (this is expensive),
computing the split and then writing the split out. This requires a
pass through the data for each node of the tree.
2) as 1), except that each machine writes out the filtered data after
its node in the tree. This requires less scanning (as much as the
local case I described earlier), but lots of data movement.

Do you have an alternative method?

Andy


On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote:
IF we have a step which permutes data (once) then I doubt that
redistribution is necessary.  At that point the randomness consists of
building trees based on different variable subsets and data subsets.  The
original random forests only split on variable subsets.  How much this
matters is an open question.

On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> wrote:

It sounds OK to me. Yes, I don't think you want to try to parallelize
the building of each tree just to build it off all the data. It will
be too slow.

I imagine the game is to parallelize such that each worker has 1/Nth
of the data, where N is as low as you can manage. Each worker is
building one or more trees from its slice. Then iterate, and split up
the data differently. Each worker can now cross-validate previous
trees based on the new slice, or even build new ones, and iterate
again.

I'm guessing this process will tend to need to build more trees than
usual. I think it's also possible that it will want to try to discard
trees with high out-of-bag error.

Just riffing. I think this is some potential green-field territory to
really think about how you build trees when it doesn't nearly fit in
memory, and what shape of computation gives a really nice tradeoff
between speed and accuracy. I don't think the conventional approach is
that point but it's more like the above.

On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected]>
wrote:
Here's my thought for what might work well. Comments very welcome.

We construct k decision trees (where k might be 20-100). Each tree has
depth roughly 10-30. Rather than construct each tree in a distributed
fashion, construct each tree locally by using multiple passes over the
bagged sample for that tree (construct in BFS manner, assuming the
tree can fit in memory). Do this in parallel k times. I know this is
perhaps not the "mapreduce" way, but I'd be interested to hear
alternatives. It seems a bad idea to shuffle the data around on every
BFS iteration per tree.

I'm not planning to work on this right now, but would like to figure
out if Mahout is a good platform to work with if I want such an
algorithm.

Andy


--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538


Reply via email to