Hi, I have a question about hadoop, which most likely someone in mahout must have solved before:
Many online ML algorithms require multiple passes over data for best performance. When putting these algorithms on hadoop, one would want to run the code close to the data (same machine/rack). Mappers offer this data-local execution but do not offer means to run multiple times over the data. Of course, one could run the code outside of the hadoop mapreduce framework as a HDFS client, but that does not offer the data-locality advantage, in addition to not being scheduled through the hadoop schedulers. How is this solved in mahout? Thanks for any pointer, Markus
