Even better, there is already a good implementation of the histograms: https://github.com/bigmlcom/histogram
-Andy On 20 February 2013 22:50, Marty Kube <[email protected]> wrote: > That's a winner... > Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most > likely. In batch mode it uses one pass over the data set, it can be used in > a streaming mode, and has constant space and time requirements. That seems > like the kind of scalable algorithm we're after. > I'm in! > > > On 02/20/2013 10:09 AM, Andy Twigg wrote: >> >> Alternatively, the algorithm described in [1] is more straightforward, >> efficient, hadoop-compatible (using only mappers communicating to a >> master) and satisfies all our requirements so far. I would like to >> take a pass at implementing that, if anyone else is interested? >> >> [1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf >> >> >> On 20 February 2013 14:27, Andy Twigg <[email protected]> wrote: >>> >>> Why don't we start from >>> >>> https://github.com/ashenfad/hadooptree ? >>> >>> On 20 February 2013 13:25, Marty Kube <[email protected]> >>> wrote: >>>> >>>> Hi Lorenz, >>>> >>>> Very interesting, that's what I was asking for when I mentioned non-MR >>>> implementations :-) >>>> >>>> I have not looked at spark before, interesting that it uses Mesos for >>>> clustering. I'll check it out. >>>> >>>> >>>> On 02/19/2013 09:32 PM, Lorenz Knies wrote: >>>>> >>>>> Hi Marty, >>>>> >>>>> i am currently working on a PLANET-like implementation on top of spark: >>>>> http://spark-project.org >>>>> >>>>> I think this framework is a nice fit for the problem. >>>>> If the input data fits into the "total cluster memory" you benefit from >>>>> the caching of the RDD's. >>>>> >>>>> regards, >>>>> >>>>> lorenz >>>>> >>>>> >>>>> On Feb 20, 2013, at 2:42 AM, Marty Kube <[email protected]> >>>>> wrote: >>>>> >>>>>> You had mentioned other "resource management" platforms like Giraph or >>>>>> Mesos. I haven't looked at those yet. I guess I was think of other >>>>>> parallelization frameworks. >>>>>> >>>>>> It's interesting that the planet folks thought it was really >>>>>> worthwhile >>>>>> working on top of map reduce for all of the resource management that >>>>>> is >>>>>> built in. >>>>>> >>>>>> >>>>>> On 02/19/2013 08:04 PM, Ted Dunning wrote: >>>>>>> >>>>>>> If non-MR means map-only job with communicating mappers and a state >>>>>>> store, >>>>>>> I am down with that. >>>>>>> >>>>>>> What did you mean? >>>>>>> >>>>>>> On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Right now I'd lean towards the planet model, or maybe a non-MR >>>>>>>> implementation. Anyone have a good idea for a non-MR solution? >>>>>>>> >>> >>> >>> -- >>> Dr Andy Twigg >>> Junior Research Fellow, St Johns College, Oxford >>> Room 351, Department of Computer Science >>> http://www.cs.ox.ac.uk/people/andy.twigg/ >>> [email protected] | +447799647538 >> >> >> >> -- >> Dr Andy Twigg >> Junior Research Fellow, St Johns College, Oxford >> Room 351, Department of Computer Science >> http://www.cs.ox.ac.uk/people/andy.twigg/ >> [email protected] | +447799647538 > > -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538
