Tenaali Ram wrote:
Hi,

I am new to hadoop. What I have understood so far is- hadoop is used to
process huge data using map-reduce paradigm.

I am working on problem where I need to perform large number of
computations, most computations can be done independently of each other (so
I think each mapper can handle one or more such computations). However there
is no data involved. Its just number crunching job. Is it suited for Hadoop
?


well, you can have the MR jobs stick data out into the filesystem. So even though they don't start of located, they end up running where the output needs to go.

Has anyone used hadoop for merely number crunching? If yes, how should I
define input for the job and ensure that computations are distributed to all
nodes in the grid?

The current scheduler moves work to near where the data sources are, going for the same machine or same rack, looking for a task tracker with a spare "slot". There isn't yet any scheduler that worried more about pure computation, where you need to consider current CPU load, memory consumption and power budget -whether your rack is running so hot its at risk of being shut down, or at least throttled back. That's the kind of scheduling where the e-science and grid toolkit people have the edge.

But now that the version of hadoop in SVN has support for plug-in scheduling, someone has the opportunity to write a new scheduler, one that focuses on pure computation...


Reply via email to