Re: Out-of-core random forest implementation

Meng Qinghan Wed, 06 Mar 2013 18:31:42 -0800

You can read the following paper which I think it is useful.

http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_12?LI=true


2013/3/6 Andy Twigg <[email protected]>

> Hi guys,
>
> I've created a JIRA issue for this now -
> https://issues.apache.org/jira/browse/MAHOUT-1153
>
> Right now we're working on my fork but will make an early patch once
> we can. Hopefully that will provide a basis for others who are
> interested to add to it.
>
> Andy
>
>
> On 22 February 2013 08:49, Andy Twigg <[email protected]> wrote:
> > Hi Marty,
> >
> > I would suggest doing each tree independently, one after the other.
> > Have each node select a random sample w/replacement of its data, and
> > let the master choose the random subset of features to split with. I'm
> > sure there are more efficient ways, but we can think of them later.
> >
> > -Andy
> >
> >
> > On 22 February 2013 01:47, Marty Kube <[email protected]>
> wrote:
> >> On the algorithm choice, Parallel Boosted Regression Trees looks
> interesting
> >> as well.
> >>
> >>
> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
> >>
> >> After reading that paper I had a question about the Streaming Parallel
> >> Decision Trees.  The SPDT paper is for building one decision tree.  I
> was
> >> thinking about how to extend that to a decision forest.  I could see
> how one
> >> can build many trees in the master and use a random choice of the
> splitting
> >> variables (as opposed to checking all variables and using the best
> >> variable). However, it seems like the idea of sampling the dataset with
> >> replacement for each tree gets lost.  Is that okay?
> >>
> >>
> >> On 02/21/2013 03:19 AM, Ted Dunning wrote:
> >>>
> >>> For quantile estimation, consider also streamlib at
> >>> https://github.com/clearspring/stream-lib
> >>>
> >>> The bigmlcom implementation looks more directly applicable, actually.
> >>>
> >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]
> >>> <mailto:[email protected]>> wrote:
> >>>
> >>>     Even better, there is already a good implementation of the
> histograms:
> >>>     https://github.com/bigmlcom/histogram
> >>>
> >>>     -Andy
> >>>
> >>>
> >>>     On 20 February 2013 22:50, Marty Kube <[email protected]
> >>>     <mailto:[email protected]>> wrote:
> >>>     > That's a winner...
> >>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
> >>>     looks most
> >>>     > likely.  In batch mode it uses one pass over the data set, it
> >>>     can be used in
> >>>     > a streaming mode, and has constant space and time requirements.
> >>>      That seems
> >>>     > like the kind of scalable algorithm we're after.
> >>>     > I'm in!
> >>>     >
> >>>     >
> >>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
> >>>     >>
> >>>     >> Alternatively, the algorithm described in [1] is more
> >>>     straightforward,
> >>>     >> efficient, hadoop-compatible (using only mappers communicating
> to a
> >>>     >> master) and satisfies all our requirements so far. I would like
> to
> >>>     >> take a pass at implementing that, if anyone else is interested?
> >>>     >>
> >>>     >> [1]
> >>>
> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
> >>>     >>
> >>>     >>
> >>>     >> On 20 February 2013 14:27, Andy Twigg <[email protected]
> >>>     <mailto:[email protected]>> wrote:
> >>>     >>>
> >>>     >>> Why don't we start from
> >>>     >>>
> >>>     >>> https://github.com/ashenfad/hadooptree ?
> >>>     >>>
> >>>     >>> On 20 February 2013 13:25, Marty Kube
> >>>     <[email protected] <mailto:[email protected]>>
> >>>
> >>>     >>> wrote:
> >>>     >>>>
> >>>     >>>> Hi Lorenz,
> >>>     >>>>
> >>>     >>>> Very interesting, that's what I was asking for when I
> >>>     mentioned non-MR
> >>>     >>>> implementations :-)
> >>>     >>>>
> >>>     >>>> I have not looked at spark before, interesting that it uses
> >>>     Mesos for
> >>>     >>>> clustering.   I'll check it out.
> >>>     >>>>
> >>>     >>>>
> >>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
> >>>     >>>>>
> >>>     >>>>> Hi Marty,
> >>>     >>>>>
> >>>     >>>>> i am currently working on a PLANET-like implementation on
> >>>     top of spark:
> >>>     >>>>> http://spark-project.org
> >>>     >>>>>
> >>>     >>>>> I think this framework is a nice fit for the problem.
> >>>     >>>>> If the input data fits into the "total cluster memory" you
> >>>     benefit from
> >>>     >>>>> the caching of the RDD's.
> >>>     >>>>>
> >>>     >>>>> regards,
> >>>     >>>>>
> >>>     >>>>> lorenz
> >>>     >>>>>
> >>>     >>>>>
> >>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
> >>>     <[email protected] <mailto:[email protected]>>
> >>>
> >>>     >>>>> wrote:
> >>>     >>>>>
> >>>     >>>>>> You had mentioned other "resource management" platforms
> >>>     like Giraph or
> >>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
> >>>     of other
> >>>     >>>>>> parallelization frameworks.
> >>>     >>>>>>
> >>>     >>>>>> It's interesting that the planet folks thought it was really
> >>>     >>>>>> worthwhile
> >>>     >>>>>> working on top of map reduce for all of the resource
> >>>     management that
> >>>     >>>>>> is
> >>>     >>>>>> built in.
> >>>     >>>>>>
> >>>     >>>>>>
> >>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
> >>>     >>>>>>>
> >>>     >>>>>>> If non-MR means map-only job with communicating mappers
> >>>     and a state
> >>>     >>>>>>> store,
> >>>     >>>>>>> I am down with that.
> >>>     >>>>>>>
> >>>     >>>>>>> What did you mean?
> >>>     >>>>>>>
> >>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
> >>>     >>>>>>> [email protected]
> >>>     <mailto:[email protected]>> wrote:
> >>>     >>>>>>>
> >>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
> >>>     non-MR
> >>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
> >>>     solution?
> >>>     >>>>>>>>
> >>>     >>>
> >>>     >>>
> >>>     >>> --
> >>>     >>> Dr Andy Twigg
> >>>     >>> Junior Research Fellow, St Johns College, Oxford
> >>>     >>> Room 351, Department of Computer Science
> >>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     >>> [email protected] <mailto:[email protected]> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>     >>
> >>>     >>
> >>>     >>
> >>>     >> --
> >>>     >> Dr Andy Twigg
> >>>     >> Junior Research Fellow, St Johns College, Oxford
> >>>     >> Room 351, Department of Computer Science
> >>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     >> [email protected] <mailto:[email protected]> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>     >
> >>>     >
> >>>
> >>>
> >>>
> >>>     --
> >>>     Dr Andy Twigg
> >>>     Junior Research Fellow, St Johns College, Oxford
> >>>     Room 351, Department of Computer Science
> >>>     http://www.cs.ox.ac.uk/people/andy.twigg/
> >>>     [email protected] <mailto:[email protected]> |
> >>>     +447799647538 <tel:%2B447799647538>
> >>>
> >>>
> >>
> >
> >
> >
> > --
> > Dr Andy Twigg
> > Junior Research Fellow, St Johns College, Oxford
> > Room 351, Department of Computer Science
> > http://www.cs.ox.ac.uk/people/andy.twigg/
> > [email protected] | +447799647538
>
>
>
> --
> Dr Andy Twigg
> Junior Research Fellow, St Johns College, Oxford
> Room 351, Department of Computer Science
> http://www.cs.ox.ac.uk/people/andy.twigg/
> [email protected] | +447799647538
>

Re: Out-of-core random forest implementation

Reply via email to