On Sun, Dec 25, 2011 at 4:08 PM, Lingxiang Cheng
<[email protected]>wrote:

>
>    Thanks for the answer. I am having some difficulty understanding why
> running random forest on top of Hadoop "does not produce arbitrary
> scalability". Could you elaborate?
>

The problem is that the problem is difficult to decompose well and get
linear scaling.  For instance, if you shard the data by features, you want
to have overlap between the features for different shards.  This means that
the total data processed during learning increases super-linearly with the
number of shards.

On the other hand, sharding by training data records leaves you with a
problem of how to combine different models and whether you get the kind of
improved training that you want.  Just taking the union of trees in each
ensemble probably isn't that effective (based on analogizing from other
types of learning).



> Also, are you aware of any work that involved developing random forest
> using map-reduce?
>

Well, we have it.  There are fancier efforts as well.  Have you done a web
search?

Reply via email to