Re: Out-of-core random forest implementation

Andy Twigg Thu, 07 Mar 2013 03:26:28 -0800

Thanks for bringing that to my attention. I don't have access to the
pdf right now, but i will take a look. Right now what we have is a
single-machine procedure for scanning through some data, building a
set of histograms, combining histograms and then expanding the tree.
The next step is to decide the best way to distribute this. I'm not an
expert here, so any advice or help here is welcome.


I think the easiest approach would be to use the mappers to construct
the set of histograms, and then send all histograms for a given leaf
to a reducer, which decides how to expand that leaf. The code I have
can be almost be ported as-is to a mapper and reducer in this way.
Would using the distributed cache to send the updated tree be wise, or
is there a better way?



The phd thesis on streaming random forests [1] might also be of interest.
[1] http://www.collectionscanada.gc.ca/obj/s4/f2/dsk3/OKQ/TC-OKQ-1321.pdf

On 7 March 2013 02:31, Meng Qinghan <[email protected]> wrote:
> You can read the following paper which I think it is useful.
>
> http://link.springer.com/chapter/10.1007%2F978-3-642-30217-6_12?LI=true
>
> 2013/3/6 Andy Twigg <[email protected]>
>
>> Hi guys,
>>
>> I've created a JIRA issue for this now -
>> https://issues.apache.org/jira/browse/MAHOUT-1153
>>
>> Right now we're working on my fork but will make an early patch once
>> we can. Hopefully that will provide a basis for others who are
>> interested to add to it.
>>
>> Andy
>>
>>
>> On 22 February 2013 08:49, Andy Twigg <[email protected]> wrote:
>> > Hi Marty,
>> >
>> > I would suggest doing each tree independently, one after the other.
>> > Have each node select a random sample w/replacement of its data, and
>> > let the master choose the random subset of features to split with. I'm
>> > sure there are more efficient ways, but we can think of them later.
>> >
>> > -Andy
>> >
>> >
>> > On 22 February 2013 01:47, Marty Kube <[email protected]>
>> wrote:
>> >> On the algorithm choice, Parallel Boosted Regression Trees looks
>> interesting
>> >> as well.
>> >>
>> >>
>> http://research.engineering.wustl.edu/~tyrees/Publications_files/fr819-tyreeA.pdf
>> >>
>> >> After reading that paper I had a question about the Streaming Parallel
>> >> Decision Trees.  The SPDT paper is for building one decision tree.  I
>> was
>> >> thinking about how to extend that to a decision forest.  I could see
>> how one
>> >> can build many trees in the master and use a random choice of the
>> splitting
>> >> variables (as opposed to checking all variables and using the best
>> >> variable). However, it seems like the idea of sampling the dataset with
>> >> replacement for each tree gets lost.  Is that okay?
>> >>
>> >>
>> >> On 02/21/2013 03:19 AM, Ted Dunning wrote:
>> >>>
>> >>> For quantile estimation, consider also streamlib at
>> >>> https://github.com/clearspring/stream-lib
>> >>>
>> >>> The bigmlcom implementation looks more directly applicable, actually.
>> >>>
>> >>> On Wed, Feb 20, 2013 at 5:01 PM, Andy Twigg <[email protected]
>> >>> <mailto:[email protected]>> wrote:
>> >>>
>> >>>     Even better, there is already a good implementation of the
>> histograms:
>> >>>     https://github.com/bigmlcom/histogram
>> >>>
>> >>>     -Andy
>> >>>
>> >>>
>> >>>     On 20 February 2013 22:50, Marty Kube <[email protected]
>> >>>     <mailto:[email protected]>> wrote:
>> >>>     > That's a winner...
>> >>>     > Out of all of the algorithms I've looked at the Ben-Haim/SPDT
>> >>>     looks most
>> >>>     > likely.  In batch mode it uses one pass over the data set, it
>> >>>     can be used in
>> >>>     > a streaming mode, and has constant space and time requirements.
>> >>>      That seems
>> >>>     > like the kind of scalable algorithm we're after.
>> >>>     > I'm in!
>> >>>     >
>> >>>     >
>> >>>     > On 02/20/2013 10:09 AM, Andy Twigg wrote:
>> >>>     >>
>> >>>     >> Alternatively, the algorithm described in [1] is more
>> >>>     straightforward,
>> >>>     >> efficient, hadoop-compatible (using only mappers communicating
>> to a
>> >>>     >> master) and satisfies all our requirements so far. I would like
>> to
>> >>>     >> take a pass at implementing that, if anyone else is interested?
>> >>>     >>
>> >>>     >> [1]
>> >>>
>> http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>> >>>     >>
>> >>>     >>
>> >>>     >> On 20 February 2013 14:27, Andy Twigg <[email protected]
>> >>>     <mailto:[email protected]>> wrote:
>> >>>     >>>
>> >>>     >>> Why don't we start from
>> >>>     >>>
>> >>>     >>> https://github.com/ashenfad/hadooptree ?
>> >>>     >>>
>> >>>     >>> On 20 February 2013 13:25, Marty Kube
>> >>>     <[email protected] <mailto:[email protected]>>
>> >>>
>> >>>     >>> wrote:
>> >>>     >>>>
>> >>>     >>>> Hi Lorenz,
>> >>>     >>>>
>> >>>     >>>> Very interesting, that's what I was asking for when I
>> >>>     mentioned non-MR
>> >>>     >>>> implementations :-)
>> >>>     >>>>
>> >>>     >>>> I have not looked at spark before, interesting that it uses
>> >>>     Mesos for
>> >>>     >>>> clustering.   I'll check it out.
>> >>>     >>>>
>> >>>     >>>>
>> >>>     >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>> >>>     >>>>>
>> >>>     >>>>> Hi Marty,
>> >>>     >>>>>
>> >>>     >>>>> i am currently working on a PLANET-like implementation on
>> >>>     top of spark:
>> >>>     >>>>> http://spark-project.org
>> >>>     >>>>>
>> >>>     >>>>> I think this framework is a nice fit for the problem.
>> >>>     >>>>> If the input data fits into the "total cluster memory" you
>> >>>     benefit from
>> >>>     >>>>> the caching of the RDD's.
>> >>>     >>>>>
>> >>>     >>>>> regards,
>> >>>     >>>>>
>> >>>     >>>>> lorenz
>> >>>     >>>>>
>> >>>     >>>>>
>> >>>     >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube
>> >>>     <[email protected] <mailto:[email protected]>>
>> >>>
>> >>>     >>>>> wrote:
>> >>>     >>>>>
>> >>>     >>>>>> You had mentioned other "resource management" platforms
>> >>>     like Giraph or
>> >>>     >>>>>> Mesos.  I haven't looked at those yet.  I guess I was think
>> >>>     of other
>> >>>     >>>>>> parallelization frameworks.
>> >>>     >>>>>>
>> >>>     >>>>>> It's interesting that the planet folks thought it was really
>> >>>     >>>>>> worthwhile
>> >>>     >>>>>> working on top of map reduce for all of the resource
>> >>>     management that
>> >>>     >>>>>> is
>> >>>     >>>>>> built in.
>> >>>     >>>>>>
>> >>>     >>>>>>
>> >>>     >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>> >>>     >>>>>>>
>> >>>     >>>>>>> If non-MR means map-only job with communicating mappers
>> >>>     and a state
>> >>>     >>>>>>> store,
>> >>>     >>>>>>> I am down with that.
>> >>>     >>>>>>>
>> >>>     >>>>>>> What did you mean?
>> >>>     >>>>>>>
>> >>>     >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>> >>>     >>>>>>> [email protected]
>> >>>     <mailto:[email protected]>> wrote:
>> >>>     >>>>>>>
>> >>>     >>>>>>>> Right now I'd lean towards the planet model, or maybe a
>> >>>     non-MR
>> >>>     >>>>>>>> implementation.  Anyone have a good idea for a non-MR
>> >>>     solution?
>> >>>     >>>>>>>>
>> >>>     >>>
>> >>>     >>>
>> >>>     >>> --
>> >>>     >>> Dr Andy Twigg
>> >>>     >>> Junior Research Fellow, St Johns College, Oxford
>> >>>     >>> Room 351, Department of Computer Science
>> >>>     >>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> >>>     >>> [email protected] <mailto:[email protected]> |
>> >>>     +447799647538 <tel:%2B447799647538>
>> >>>
>> >>>     >>
>> >>>     >>
>> >>>     >>
>> >>>     >> --
>> >>>     >> Dr Andy Twigg
>> >>>     >> Junior Research Fellow, St Johns College, Oxford
>> >>>     >> Room 351, Department of Computer Science
>> >>>     >> http://www.cs.ox.ac.uk/people/andy.twigg/
>> >>>     >> [email protected] <mailto:[email protected]> |
>> >>>     +447799647538 <tel:%2B447799647538>
>> >>>
>> >>>     >
>> >>>     >
>> >>>
>> >>>
>> >>>
>> >>>     --
>> >>>     Dr Andy Twigg
>> >>>     Junior Research Fellow, St Johns College, Oxford
>> >>>     Room 351, Department of Computer Science
>> >>>     http://www.cs.ox.ac.uk/people/andy.twigg/
>> >>>     [email protected] <mailto:[email protected]> |
>> >>>     +447799647538 <tel:%2B447799647538>
>> >>>
>> >>>
>> >>
>> >
>> >
>> >
>> > --
>> > Dr Andy Twigg
>> > Junior Research Fellow, St Johns College, Oxford
>> > Room 351, Department of Computer Science
>> > http://www.cs.ox.ac.uk/people/andy.twigg/
>> > [email protected] | +447799647538
>>
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> [email protected] | +447799647538
>>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Out-of-core random forest implementation

Reply via email to