Hi guys, I've created a JIRA issue for this now - https://issues.apache.org/jira/browse/MAHOUT-1153
Right now we're working on my fork but will make an early patch once we can. Hopefully that will provide a basis for others who are interested to add to it. Andy On 22 February 2013 08:49, Andy Twigg <[email protected]> wrote: > Hi Marty, > > I would suggest doing each tree independently, one after the other. > Have each node select a random sample w/replacement of its data, and > let the master choose the random subset of features to split with. I'm > sure there are more efficient ways, but we can think of them later. > > -Andy > > > On 22 February 2013 01:47, Marty Kube <[email protected]> wrote: >> On the algorithm choice, Parallel Boosted Regression Trees looks interesting >> as well. >> >> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf >> >> After reading that paper I had a question about the Streaming Parallel >> Decision Trees. The SPDT paper is for building one decision tree. I was >> thinking about how to extend that to a decision forest. I could see how one >> can build many trees in the master and use a random choice of the splitting >> variables (as opposed to checking all variables and using the best >> variable). However, it seems like the idea of sampling the dataset with >> replacement for each tree gets lost. Is that okay? >> >> >> On 02/21/2013 03:19 AM, Ted Dunning wrote: >>> >>> For quantile estimation, consider also streamlib at >>> https://github.com/clearspring/stream-lib >>> >>> The bigmlcom implementation looks more directly applicable, actually. >>> >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> Even better, there is already a good implementation of the histograms: >>> https://github.com/bigmlcom/histogram >>> >>> -Andy >>> >>> >>> On 20 February 2013 22:50, Marty Kube <[email protected] >>> <mailto:[email protected]>> wrote: >>> > That's a winner... >>> > Out of all of the algorithms I've looked at the Ben-Haim/SPDT >>> looks most >>> > likely. In batch mode it uses one pass over the data set, it >>> can be used in >>> > a streaming mode, and has constant space and time requirements. >>> That seems >>> > like the kind of scalable algorithm we're after. >>> > I'm in! >>> > >>> > >>> > On 02/20/2013 10:09 AM, Andy Twigg wrote: >>> >> >>> >> Alternatively, the algorithm described in [1] is more >>> straightforward, >>> >> efficient, hadoop-compatible (using only mappers communicating to a >>> >> master) and satisfies all our requirements so far. I would like to >>> >> take a pass at implementing that, if anyone else is interested? >>> >> >>> >> [1] >>> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf >>> >> >>> >> >>> >> On 20 February 2013 14:27, Andy Twigg <[email protected] >>> <mailto:[email protected]>> wrote: >>> >>> >>> >>> Why don't we start from >>> >>> >>> >>> https://github.com/ashenfad/hadooptree ? >>> >>> >>> >>> On 20 February 2013 13:25, Marty Kube >>> <[email protected] <mailto:[email protected]>> >>> >>> >>> wrote: >>> >>>> >>> >>>> Hi Lorenz, >>> >>>> >>> >>>> Very interesting, that's what I was asking for when I >>> mentioned non-MR >>> >>>> implementations :-) >>> >>>> >>> >>>> I have not looked at spark before, interesting that it uses >>> Mesos for >>> >>>> clustering. I'll check it out. >>> >>>> >>> >>>> >>> >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote: >>> >>>>> >>> >>>>> Hi Marty, >>> >>>>> >>> >>>>> i am currently working on a PLANET-like implementation on >>> top of spark: >>> >>>>> http://spark-project.org >>> >>>>> >>> >>>>> I think this framework is a nice fit for the problem. >>> >>>>> If the input data fits into the "total cluster memory" you >>> benefit from >>> >>>>> the caching of the RDD's. >>> >>>>> >>> >>>>> regards, >>> >>>>> >>> >>>>> lorenz >>> >>>>> >>> >>>>> >>> >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube >>> <[email protected] <mailto:[email protected]>> >>> >>> >>>>> wrote: >>> >>>>> >>> >>>>>> You had mentioned other "resource management" platforms >>> like Giraph or >>> >>>>>> Mesos. I haven't looked at those yet. I guess I was think >>> of other >>> >>>>>> parallelization frameworks. >>> >>>>>> >>> >>>>>> It's interesting that the planet folks thought it was really >>> >>>>>> worthwhile >>> >>>>>> working on top of map reduce for all of the resource >>> management that >>> >>>>>> is >>> >>>>>> built in. >>> >>>>>> >>> >>>>>> >>> >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote: >>> >>>>>>> >>> >>>>>>> If non-MR means map-only job with communicating mappers >>> and a state >>> >>>>>>> store, >>> >>>>>>> I am down with that. >>> >>>>>>> >>> >>>>>>> What did you mean? >>> >>>>>>> >>> >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube < >>> >>>>>>> [email protected] >>> <mailto:[email protected]>> wrote: >>> >>>>>>> >>> >>>>>>>> Right now I'd lean towards the planet model, or maybe a >>> non-MR >>> >>>>>>>> implementation. Anyone have a good idea for a non-MR >>> solution? >>> >>>>>>>> >>> >>> >>> >>> >>> >>> -- >>> >>> Dr Andy Twigg >>> >>> Junior Research Fellow, St Johns College, Oxford >>> >>> Room 351, Department of Computer Science >>> >>> http://www.cs.ox.ac.uk/people/andy.twigg/ >>> >>> [email protected] <mailto:[email protected]> | >>> +447799647538 <tel:%2B447799647538> >>> >>> >> >>> >> >>> >> >>> >> -- >>> >> Dr Andy Twigg >>> >> Junior Research Fellow, St Johns College, Oxford >>> >> Room 351, Department of Computer Science >>> >> http://www.cs.ox.ac.uk/people/andy.twigg/ >>> >> [email protected] <mailto:[email protected]> | >>> +447799647538 <tel:%2B447799647538> >>> >>> > >>> > >>> >>> >>> >>> -- >>> Dr Andy Twigg >>> Junior Research Fellow, St Johns College, Oxford >>> Room 351, Department of Computer Science >>> http://www.cs.ox.ac.uk/people/andy.twigg/ >>> [email protected] <mailto:[email protected]> | >>> +447799647538 <tel:%2B447799647538> >>> >>> >> > > > > -- > Dr Andy Twigg > Junior Research Fellow, St Johns College, Oxford > Room 351, Department of Computer Science > http://www.cs.ox.ac.uk/people/andy.twigg/ > [email protected] | +447799647538 -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538
