Hadoop for computationally intensive tasks (no data)

2008-09-04 Thread Tenaali Ram
Hi,

I am new to hadoop. What I have understood so far is- hadoop is used to
process huge data using map-reduce paradigm.

I am working on problem where I need to perform large number of
computations, most computations can be done independently of each other (so
I think each mapper can handle one or more such computations). However there
is no data involved. Its just number crunching job. Is it suited for Hadoop
?

Has anyone used hadoop for merely number crunching? If yes, how should I
define input for the job and ensure that computations are distributed to all
nodes in the grid?

Thanks,
Tenaali


Re: Hadoop for computationally intensive tasks (no data)

2008-09-04 Thread Miles Osborne
have a look at the various machine learning applications of Map
Reduce:  they do lots of computations and here, the data corresponds
to intermediate values being used to update counts etc.

bedtime reading:

Mahout: (machine learning under Hadoop)

http://lucene.apache.org/mahout/

some machine learning papers:

Fully Distributed EM for Very Large Datasets.
Jason Wolfe, Aria Haghighi and Dan Klein

www.cs.berkeley.edu/~aria42/pubs/icml08-distributedem.pdf

another one:

www.cs.stanford.edu/people/ang//papers/nips06-mapreducemulticore.pdf

Miles

2008/9/4 Tenaali Ram <[EMAIL PROTECTED]>:
> Hi,
>
> I am new to hadoop. What I have understood so far is- hadoop is used to
> process huge data using map-reduce paradigm.
>
> I am working on problem where I need to perform large number of
> computations, most computations can be done independently of each other (so
> I think each mapper can handle one or more such computations). However there
> is no data involved. Its just number crunching job. Is it suited for Hadoop
> ?
>
> Has anyone used hadoop for merely number crunching? If yes, how should I
> define input for the job and ensure that computations are distributed to all
> nodes in the grid?
>
> Thanks,
> Tenaali
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.


Re: Hadoop for computationally intensive tasks (no data)

2008-09-04 Thread Owen O'Malley
On Thu, Sep 4, 2008 at 10:07 AM, Tenaali Ram <[EMAIL PROTECTED]> wrote:


> Has anyone used hadoop for merely number crunching? If yes, how should I
> define input for the job and ensure that computations are distributed to
> all
> nodes in the grid?


Yeah, it is pretty easy to do actually. If you really just have distributed
tasks, you can set the number of reduces to 0. The output of each map will
be given straight to the OutputFormat, which typically writes it into HDFS.

I wrote the Dancing Links
exampleto
do state space exploration in map/reduce. And although a backtracking
algorithm seems like an unlikely match, it worked well. I generated the
prefixes up to a given level and wrote them one per a line. The maps each
get a set of lines and explore the entire tree downward from the prefixes
they are given. There is a single reduce that collects the answers. It was
able to solve a problem that Knuth had given up on as taking too long in 9
hours on a very small cluster.

If you look at the Powered By
Hadooppage, you'll see more
examples.

-- Owen


Re: Hadoop for computationally intensive tasks (no data)

2008-09-05 Thread Steve Loughran

Tenaali Ram wrote:

Hi,

I am new to hadoop. What I have understood so far is- hadoop is used to
process huge data using map-reduce paradigm.

I am working on problem where I need to perform large number of
computations, most computations can be done independently of each other (so
I think each mapper can handle one or more such computations). However there
is no data involved. Its just number crunching job. Is it suited for Hadoop
?



well, you can have the MR jobs stick data out into the filesystem. So 
even though they don't start of located, they end up running where the 
output needs to go.



Has anyone used hadoop for merely number crunching? If yes, how should I
define input for the job and ensure that computations are distributed to all
nodes in the grid?


The current scheduler moves work to near where the data sources are, 
going for the same machine or same rack, looking for a task tracker with 
a spare "slot". There isn't yet any scheduler that worried more about 
pure computation, where you need to consider current CPU load, memory 
consumption and power budget -whether your rack is running so hot its at 
risk of being shut down, or at least throttled back. That's the kind of 
scheduling where the e-science and grid toolkit people have the edge.


But now that the version of hadoop in SVN has support for plug-in 
scheduling, someone has the opportunity to write a new scheduler, one 
that focuses on pure computation...