On Mon, Oct 4, 2010 at 11:44 AM, deneche abdelhakim <[email protected]>wrote:
> For Decision Forests, my goal for 0.5 is to add a 'full' > implementation. Meaning, an implementation that can build random > forests using the whole dataset, even if its split among many > machines. I found the following paper to be very interesting: > http://www.cba.ua.edu/~mhardin/rainforest.pdf > although the described approach doesn't work as it is for numerical > attributes. > Very cool. I would love it if DF became a first class Mahout classifier. As well as scaling up, it would be very nice if there were a model compression step to help with the deployment of DF models. > > The implementation should at least work for the following dataset: > > http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2304&categoryID=248 > it's 50 GB, and a small subset is available in UCI. It contains only > categorical attributes, and it's big enough to be a good candidate. > Which UCI dataset is this? The income>50k$ one? Does the AWS dataset have household income?
