Hi all, I'm trying to design an MR job for processing "walks-on-graph" data from database. The idea is that I have a list of random walks on a graph (which is unknown).
I have two tables ("walk ids" and "hops"): - the first holds the list of random-walk ids, one row per walk, each is unique id (increasing). - the second holds, for each walk (identified by the uid) the list of hops (vertices) traversed in the walk (one hop per row) -- these two tables are in a "one-to-many" structure, with the walk uid used as a foreign key in the hops table. Meaning, walks should be split between nodes but hops per walk must not. How would you suggest handling this structure? is it even possible with DBInputFormat? Second, assuming it is possible to have this split in an MR job, I would like to have different reducers that operate on the data during reading (I want to avoid multiple reading since it can take a long time). For example, one Reducer should create the actual graph: (Source Node,Dest Node)-->(num_walks). Another one should create a length analysis: (Origin Node, Final Node)-->distance etc. Any comments and thoughts will help! Thanks. -- View this message in context: http://www.nabble.com/Job-design-question-tp25076132p25076132.html Sent from the Hadoop core-user mailing list archive at Nabble.com.