Hi All, We have a problem in hand which we would like to solve using Distributed and Parallel Processing.
*Problem context* : We have a Map (Entity, Value). The entity can have a parent which in turn will have its parent and so on till we reach the head. I have to traverse this tree and do some calculations at every step using the value of the Map. The final output will again be a map containing the aggregated results of the computation (Entity, Computed Value). The tree structure can be quite deep and we have a huge number of entries in Map to process before coming to the final result. Processing them sequentially takes quite long time. We were thinking of using Map-Reduce to split the computation across multiple nodes in a Hadoop Cluster and then aggregate the results to get the final output. Having a quick read at the documentation and the samples, I see that both Mapper and Reducer work with implementations of InputFormat and OutPutFormat respectively. Most of the implementations appeared to me to be either File or DB based. Do we have some input-output format which directly takes/updates things from/into Memory ? or I need to provide my own Custom Input/Output Format and Record Reader/Writer implementations for the purpose ? Based upon your experiences, do you think whether Map-Reduce is the appropriate platform for these kind of scenarios or we should think of it more for huge File based data only ? Best Regards Narinder