Answering some of your email out of order, On Mon, Mar 23, 2009 at 10:00 PM, Xuan Yang <sailingw...@gmail.com> wrote:
> These days I am doing some research work on SimRank, which is an model > measuring "similarity" of objects. Great. > I think it would be great to solve these problems and implement a > mapreduce-version of algorithm for SimRank. > I intend to implement this as my Summer of Code project. Would you be > interested in this? This sounds like a fine project. > And can I get some advices from you? I am sure you can lots of advice from this group, both on the algorithm and suggestions on how to code it into a program. Back to your detailed suggestion. Here are some of my first thoughts: > 1, the directed graph could be saved in the form of edge list in hbase. And > the Result Sn(a,b) could also be saved in hbase as matrix. Hbase or flat files would be a fine way to store this and an edge list is an excellent way to store the data. The output matrix should probably be stored as triples containing row, column and value. > 2, We can distribute all the n^2 pairs into the map nodes to calculate > SimRank value of the next iteration. Hopefully you can keep this sparse. If you cannot, then the algorithm may not be suitable for use on large data no matter how you parallelize it. Skipping item 3 because I don't have time right now to analyze it in detail... > 4, besides, there are other optimization methods such as threshold could be > used in Map nodes and Reduce nodes. Thresholding is likely to be a critical step in order to preserve sparsity. > 1, It is true that mapreduce could make the computation of each node more > easier. Yet if the volume of data is very huge, the transport latency of > data will become more and more serious. I think that you will find that with map-reduce in general and with Hadoop more specifically, that as the problem gets larger, the discipline imposed by map-reduce formulation on your data transport patterns actually allows better scaling than you would expect. Of course, if your data size scales with n^2, you are in trouble no matter how your parallelize. A good example came a year or so ago with a machine translation group at a university in Maryland. They had a large program that attempted to do coocurrence counting on text corpora using a single multi-core machine. They started to convert this to Hadoop using the simplest possible representation for the cooccurrence matrix (index, value triples) and expected that the redundancy of this representation would lead to very bad results. Since they expected bad results, they also expected to do lots of optimization on the map-reduce version. Also, since the original program was largely memory based, they expected that the communication overhead of hadoop would severely hurt performance. The actual results were that an 18 hour program run on 70 machines took 20 minutes. This is nearly perfect speedup over the sequential version. The moral is that highly sequential transport of large blocks of information can be incredibly efficient. So, methods to reduce IO would be > very helpful. My first recommendation on this is to wait. Get and implementation first, then optimize. The problems you have will not be the problems you expect. > 2, SimRank is to compute the similarity between all the nodes. If we map a > group of nodes {A, B, C} into one map node, and {D, E, F} into another map > node. The computation inside set {A, B, C} will be easy, so will be set {D, > E, F}. But when we want to compute SimRank between A and D, It will not be > very convenient. Map nodes should never communicate to each other. That is the purpose of the reduce layer. I think that what you should do is organize your recursive step so that the sum happens in the reduce. Then each mapper would output records where the key is the index pair for the summation (a and b in the notation used on wikipedia) and the reduce does this summation. This implies that you change your input format slightly to be variable length records containing a node index and the In set for that node. This transformation is a very simple, one time map-reduce step. More specifically, you would have original input which initially has zero values for R: links: (Node from, Node to, double R) and a transform MR step that does this to produce an auxilliary file inputSets: (Node to, List<Node> inputs): map: (Node from, Node to) -> (to, from) reduce: (Node to, List<Node> inputs) -> to, inputs Now you need to join the original input to the auxilliary file on both the from and to indexes. This join would require two map-reduces, one to join on the from index and one to join on the to index. The reduce in the final step should emit the cross product of the input sets. Then you need to join that against the original data. That join would require a single map-reduce for the join. Finally, you need to group on the to index and sum up all of the distances and output a data set for the next round of computation. This is pretty complicated, but not all that hard to do. The key, as I mentioned before, will be to avoid catastrophic fill-in of your distance matrix. If it becomes non-sparse, then the cartesian products in the joins above will be absolutely catastrophic because you will have far too much data flying around. This problem is not specific to map-reduce, it is a problem with simRank itself. I would also recommend that you look at some alternative distance measures. One simple one is just cooccurrence as filtered by a simple test. This requires fewer map-reduce steps, does not require multiple iterations and cannot produce fill-in. I have used iterative algorithms similar to simRank, but my personal experience is that this simpler algorithm produces recommendations on par with much more complex algorithms and that subsequent presentation level changes have more potential for improvement than algorithmic changes. I hope that this helps. -- Ted Dunning, CTO DeepDyve