That's a winner...
Out of all of the algorithms I've looked at the Ben-Haim/SPDT looks most likely. In batch mode it uses one pass over the data set, it can be used in a streaming mode, and has constant space and time requirements. That seems like the kind of scalable algorithm we're after.
I'm in!

On 02/20/2013 10:09 AM, Andy Twigg wrote:
Alternatively, the algorithm described in [1] is more straightforward,
efficient, hadoop-compatible (using only mappers communicating to a
master) and satisfies all our requirements so far. I would like to
take a pass at implementing that, if anyone else is interested?

[1] http://jmlr.csail.mit.edu/papers/volume11/ben-haim10a/ben-haim10a.pdf


On 20 February 2013 14:27, Andy Twigg <[email protected]> wrote:
Why don't we start from

https://github.com/ashenfad/hadooptree ?

On 20 February 2013 13:25, Marty Kube <[email protected]> wrote:
Hi Lorenz,

Very interesting, that's what I was asking for when I mentioned non-MR
implementations :-)

I have not looked at spark before, interesting that it uses Mesos for
clustering.   I'll check it out.


On 02/19/2013 09:32 PM, Lorenz Knies wrote:
Hi Marty,

i am currently working on a PLANET-like implementation on top of spark:
http://spark-project.org

I think this framework is a nice fit for the problem.
If the input data fits into the "total cluster memory" you benefit from
the caching of the RDD's.

regards,

lorenz


On Feb 20, 2013, at 2:42 AM, Marty Kube <[email protected]>
wrote:

You had mentioned other "resource management" platforms like Giraph or
Mesos.  I haven't looked at those yet.  I guess I was think of other
parallelization frameworks.

It's interesting that the planet folks thought it was really worthwhile
working on top of map reduce for all of the resource management that is
built in.


On 02/19/2013 08:04 PM, Ted Dunning wrote:
If non-MR means map-only job with communicating mappers and a state
store,
I am down with that.

What did you mean?

On Tue, Feb 19, 2013 at 5:53 PM, Marty Kube <
[email protected]> wrote:

Right now I'd lean towards the planet model, or maybe a non-MR
implementation.  Anyone have a good idea for a non-MR solution?



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538


--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Reply via email to