You can read the following paper which I think it is useful. http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_12?LI=true
2013/3/6 Andy Twigg <[email protected]> > Hi guys, > > I've created a JIRA issue for this now - > https://issues.apache.org/jira/browse/MAHOUT-1153 > > Right now we're working on my fork but will make an early patch once > we can. Hopefully that will provide a basis for others who are > interested to add to it. > > Andy > > > On 22 February 2013 08:49, Andy Twigg <[email protected]> wrote: > > Hi Marty, > > > > I would suggest doing each tree independently, one after the other. > > Have each node select a random sample w/replacement of its data, and > > let the master choose the random subset of features to split with. I'm > > sure there are more efficient ways, but we can think of them later. > > > > -Andy > > > > > > On 22 February 2013 01:47, Marty Kube <[email protected]> > wrote: > >> On the algorithm choice, Parallel Boosted Regression Trees looks > interesting > >> as well. > >> > >> > http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf > >> > >> After reading that paper I had a question about the Streaming Parallel > >> Decision Trees. The SPDT paper is for building one decision tree. I > was > >> thinking about how to extend that to a decision forest. I could see > how one > >> can build many trees in the master and use a random choice of the > splitting > >> variables (as opposed to checking all variables and using the best > >> variable). However, it seems like the idea of sampling the dataset with > >> replacement for each tree gets lost. Is that okay? > >> > >> > >> On 02/21/2013 03:19 AM, Ted Dunning wrote: > >>> > >>> For quantile estimation, consider also streamlib at > >>> https://github.com/clearspring/stream-lib > >>> > >>> The bigmlcom implementation looks more directly applicable, actually. > >>> > >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> > >>> Even better, there is already a good implementation of the > histograms: > >>> https://github.com/bigmlcom/histogram > >>> > >>> -Andy > >>> > >>> > >>> On 20 February 2013 22:50, Marty Kube <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> > That's a winner... > >>> > Out of all of the algorithms I've looked at the Ben-Haim/SPDT > >>> looks most > >>> > likely. In batch mode it uses one pass over the data set, it > >>> can be used in > >>> > a streaming mode, and has constant space and time requirements. > >>> That seems > >>> > like the kind of scalable algorithm we're after. > >>> > I'm in! > >>> > > >>> > > >>> > On 02/20/2013 10:09 AM, Andy Twigg wrote: > >>> >> > >>> >> Alternatively, the algorithm described in [1] is more > >>> straightforward, > >>> >> efficient, hadoop-compatible (using only mappers communicating > to a > >>> >> master) and satisfies all our requirements so far. I would like > to > >>> >> take a pass at implementing that, if anyone else is interested? > >>> >> > >>> >> [1] > >>> > http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf > >>> >> > >>> >> > >>> >> On 20 February 2013 14:27, Andy Twigg <[email protected] > >>> <mailto:[email protected]>> wrote: > >>> >>> > >>> >>> Why don't we start from > >>> >>> > >>> >>> https://github.com/ashenfad/hadooptree ? > >>> >>> > >>> >>> On 20 February 2013 13:25, Marty Kube > >>> <[email protected] <mailto:[email protected]>> > >>> > >>> >>> wrote: > >>> >>>> > >>> >>>> Hi Lorenz, > >>> >>>> > >>> >>>> Very interesting, that's what I was asking for when I > >>> mentioned non-MR > >>> >>>> implementations :-) > >>> >>>> > >>> >>>> I have not looked at spark before, interesting that it uses > >>> Mesos for > >>> >>>> clustering. I'll check it out. > >>> >>>> > >>> >>>> > >>> >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote: > >>> >>>>> > >>> >>>>> Hi Marty, > >>> >>>>> > >>> >>>>> i am currently working on a PLANET-like implementation on > >>> top of spark: > >>> >>>>> http://spark-project.org > >>> >>>>> > >>> >>>>> I think this framework is a nice fit for the problem. > >>> >>>>> If the input data fits into the "total cluster memory" you > >>> benefit from > >>> >>>>> the caching of the RDD's. > >>> >>>>> > >>> >>>>> regards, > >>> >>>>> > >>> >>>>> lorenz > >>> >>>>> > >>> >>>>> > >>> >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube > >>> <[email protected] <mailto:[email protected]>> > >>> > >>> >>>>> wrote: > >>> >>>>> > >>> >>>>>> You had mentioned other "resource management" platforms > >>> like Giraph or > >>> >>>>>> Mesos. I haven't looked at those yet. I guess I was think > >>> of other > >>> >>>>>> parallelization frameworks. > >>> >>>>>> > >>> >>>>>> It's interesting that the planet folks thought it was really > >>> >>>>>> worthwhile > >>> >>>>>> working on top of map reduce for all of the resource > >>> management that > >>> >>>>>> is > >>> >>>>>> built in. > >>> >>>>>> > >>> >>>>>> > >>> >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote: > >>> >>>>>>> > >>> >>>>>>> If non-MR means map-only job with communicating mappers > >>> and a state > >>> >>>>>>> store, > >>> >>>>>>> I am down with that. > >>> >>>>>>> > >>> >>>>>>> What did you mean? > >>> >>>>>>> > >>> >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube < > >>> >>>>>>> [email protected] > >>> <mailto:[email protected]>> wrote: > >>> >>>>>>> > >>> >>>>>>>> Right now I'd lean towards the planet model, or maybe a > >>> non-MR > >>> >>>>>>>> implementation. Anyone have a good idea for a non-MR > >>> solution? > >>> >>>>>>>> > >>> >>> > >>> >>> > >>> >>> -- > >>> >>> Dr Andy Twigg > >>> >>> Junior Research Fellow, St Johns College, Oxford > >>> >>> Room 351, Department of Computer Science > >>> >>> http://www.cs.ox.ac.uk/people/andy.twigg/ > >>> >>> [email protected] <mailto:[email protected]> | > >>> +447799647538 <tel:%2B447799647538> > >>> > >>> >> > >>> >> > >>> >> > >>> >> -- > >>> >> Dr Andy Twigg > >>> >> Junior Research Fellow, St Johns College, Oxford > >>> >> Room 351, Department of Computer Science > >>> >> http://www.cs.ox.ac.uk/people/andy.twigg/ > >>> >> [email protected] <mailto:[email protected]> | > >>> +447799647538 <tel:%2B447799647538> > >>> > >>> > > >>> > > >>> > >>> > >>> > >>> -- > >>> Dr Andy Twigg > >>> Junior Research Fellow, St Johns College, Oxford > >>> Room 351, Department of Computer Science > >>> http://www.cs.ox.ac.uk/people/andy.twigg/ > >>> [email protected] <mailto:[email protected]> | > >>> +447799647538 <tel:%2B447799647538> > >>> > >>> > >> > > > > > > > > -- > > Dr Andy Twigg > > Junior Research Fellow, St Johns College, Oxford > > Room 351, Department of Computer Science > > http://www.cs.ox.ac.uk/people/andy.twigg/ > > [email protected] | +447799647538 > > > > -- > Dr Andy Twigg > Junior Research Fellow, St Johns College, Oxford > Room 351, Department of Computer Science > http://www.cs.ox.ac.uk/people/andy.twigg/ > [email protected] | +447799647538 >
