Hi All,

We have a problem in hand which we would like to solve using Distributed and
Parallel Processing.

*Problem context* : We have a Map (Entity, Value). The entity can have a
parent which in turn will have its parent and so on till we reach the head.
I have to traverse this tree and do some calculations at every step using
the value of the Map. The final output will again be a map containing the
aggregated results of the computation (Entity, Computed Value). The tree
structure can be quite deep and we have a huge number of entries in Map to
process before coming to the final result. Processing them sequentially
takes quite long time. We were thinking of using Map-Reduce to split the
computation across multiple nodes in a Hadoop Cluster and then aggregate the
results to get the final output.

Having a quick read at the documentation and the samples, I see that both
Mapper and Reducer work with implementations of InputFormat and OutPutFormat
respectively. Most of the implementations appeared to me to be either File
or DB based. Do we have some input-output format which directly
takes/updates things from/into Memory ? or I need to provide my own Custom
Input/Output Format and Record Reader/Writer implementations for the purpose
?

Based upon your experiences, do you think whether Map-Reduce is the
appropriate platform for these kind of scenarios or we should think of it
more for huge File based data only ?

Best Regards
Narinder

Reply via email to