Glad that you asked because I have been asking the same question myself when
creating a Text->Vector convertor where i need to iterate over the same data
converting them to vectors using a chunk of dictionary at a time. If i had
the option of running multiple passes. It would have taken me just a single
mapreduce. Here i have to do 1 pass over the data for every chunk of
dictionary in memory.  True, I can run n sequential job using a HDFS client
on different servers. The network data transfer  wasn't worth it.

Robin

On Fri, Jan 29, 2010 at 12:30 AM, Markus Weimer
<[email protected]>wrote:

> Hi,
>
> I have a question about hadoop, which most likely someone in mahout
> must have solved before:
>
> Many online ML algorithms require multiple passes over data for best
> performance. When putting these algorithms on hadoop, one would want
> to run the code close to the data (same machine/rack). Mappers offer
> this data-local execution but do not offer means to run multiple times
> over the data. Of course, one could run the code outside of the hadoop
> mapreduce framework as a HDFS client, but that does not offer the
> data-locality advantage, in addition to not being scheduled through
> the hadoop schedulers.
>
> How is this solved in mahout?
>
> Thanks for any pointer,
>
> Markus
>



-- 
------
Robin Anil
Blog: http://techdigger.wordpress.com
-------

Mahout in Action - Mammoth Scale machine learning
Read Chapter 1 - Its Frrreeee
http://www.manning.com/owen

Try out Swipeball for iPhone
http://itunes.com/apps/swipeball

Reply via email to