Re: Out-of-core random forest implementation

Andy Twigg Wed, 20 Feb 2013 16:03:03 -0800

Even better, there is already a good implementation of the histograms:
https://github.com/bigmlcom/histogram


-Andy


On 20 February 2013 22:50, Marty Kube <[email protected]> wrote:
> That's a winner...
> Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most
> likely.  In batch mode it uses one pass over the data set, it can be used in
> a streaming mode, and has constant space and time requirements.  That seems
> like the kind of scalable algorithm we're after.
> I'm in!
>
>
> On 02/20/2013 10:09 AM, Andy Twigg wrote:
>>
>> Alternatively, the algorithm described in [1] is more straightforward,
>> efficient, hadoop-compatible (using only mappers communicating to a
>> master) and satisfies all our requirements so far. I would like to
>> take a pass at implementing that, if anyone else is interested?
>>
>> [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf
>>
>>
>> On 20 February 2013 14:27, Andy Twigg <[email protected]> wrote:
>>>
>>> Why don't we start from
>>>
>>> https://github.com/ashenfad/hadooptree ?
>>>
>>> On 20 February 2013 13:25, Marty Kube <[email protected]>
>>> wrote:
>>>>
>>>> Hi Lorenz,
>>>>
>>>> Very interesting, that's what I was asking for when I mentioned non-MR
>>>> implementations :-)
>>>>
>>>> I have not looked at spark before, interesting that it uses Mesos for
>>>> clustering.   I'll check it out.
>>>>
>>>>
>>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote:
>>>>>
>>>>> Hi Marty,
>>>>>
>>>>> i am currently working on a PLANET-like implementation on top of spark:
>>>>> http://spark-project.org
>>>>>
>>>>> I think this framework is a nice fit for the problem.
>>>>> If the input data fits into the "total cluster memory" you benefit from
>>>>> the caching of the RDD's.
>>>>>
>>>>> regards,
>>>>>
>>>>> lorenz
>>>>>
>>>>>
>>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> You had mentioned other "resource management" platforms like Giraph or
>>>>>> Mesos.  I haven't looked at those yet.  I guess I was think of other
>>>>>> parallelization frameworks.
>>>>>>
>>>>>> It's interesting that the planet folks thought it was really
>>>>>> worthwhile
>>>>>> working on top of map reduce for all of the resource management that
>>>>>> is
>>>>>> built in.
>>>>>>
>>>>>>
>>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote:
>>>>>>>
>>>>>>> If non-MR means map-only job with communicating mappers and a state
>>>>>>> store,
>>>>>>> I am down with that.
>>>>>>>
>>>>>>> What did you mean?
>>>>>>>
>>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR
>>>>>>>> implementation.  Anyone have a good idea for a non-MR solution?
>>>>>>>>
>>>
>>>
>>> --
>>> Dr Andy Twigg
>>> Junior Research Fellow, St Johns College, Oxford
>>> Room 351, Department of Computer Science
>>> http://www.cs.ox.ac.uk/people/andy.twigg/
>>> [email protected] | +447799647538
>>
>>
>>
>> --
>> Dr Andy Twigg
>> Junior Research Fellow, St Johns College, Oxford
>> Room 351, Department of Computer Science
>> http://www.cs.ox.ac.uk/people/andy.twigg/
>> [email protected] | +447799647538
>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Re: Out-of-core random forest implementation

Reply via email to