Re: [GSoC] SimRank algorithms on Mahout

Xuan Yang Wed, 25 Mar 2009 06:18:04 -0700

Thank you for your reply :)

Here are some of my thoughts


2009/3/25 Ted Dunning <ted.dunn...@gmail.com>
>
> Answering some of your email out of order,
>
> On Mon, Mar 23, 2009 at 10:00 PM, Xuan Yang <sailingw...@gmail.com> wrote:
>
> > These days I am doing some research work on SimRank, which is an model
> > measuring "similarity" of objects.
>
>
> Great.
>
>
>
> > I think it would be great to solve these problems and implement a
> > mapreduce-version of algorithm for SimRank.
> > I intend to implement this as my Summer of Code project. Would you be
> > interested in this?
>
>
> This sounds like a fine project.



I am very happy to hear that~ :)


>
>
> > And can I get some advices from you?
>
>
> I am sure you can lots of advice from this group, both on the algorithm and
> suggestions on how to code it into a program.


thx I will work hard
>
> Back to your detailed suggestion.  Here are some of my first thoughts:
>
>
> > 1, the directed graph could be saved in the form of edge list in hbase. And
> > the Result Sn(a,b) could also be saved in hbase as matrix.
>
>
> Hbase or flat files would be a fine way to store this and an edge list is an
> excellent way to store the data.
>
> The output matrix should probably be stored as triples containing row,
> column and value.




1 How about big table? Intuitively, the matrix fits big table well. It
looks like:
          a             b           c            d         ...
     a   S(a,a)   S(a,b)     S(a,c)     S(a,d)    ...
     b   S(b,a)   S(b,b)     S(b,c)     S(b,d)    ...
     c   S(c,a)   S(c,b)     S(c,c)     S(c,d)    ...
     d   S(d,a)   S(d,d)     S(d,c)     S(d,d)    ...
     ...
     And this matrix is symmetrical and the value of cells on main
diagonal are all 1, Maybe these properties are helpful.
     Besides, we can use the iteration times as timestamp. This is
also useful when we could not garantee that we can calculate all pairs
every iteration of Map Reduce. For example: After some iterations, we
have calculated Sk+1 for part of node pair, and some other pair we
just calculated Sk ,then, in next iteration, we can use Sk to
calculate Sk+1 and Sk+1 to calculate Sk+2.

2 If we could make sure that the Matrix is sparse, use tripel will be
much more efficient.
   In a large graph, only part of the nodes are near each other,
others are far away. so their SimRank value should be very small. If
we set an threshold, set all the pairs whose SimRank value is less
than the threshold to be 0, and don't store the tripel of them.

   Yet, I don't know the time it needed to decide wheather a
particular pair is in hbase and find it. Maybe it needs indexing or
other technique such as hashing.


>
> > 2, We can distribute all the n^2 pairs into the map nodes to calculate
> > SimRank value of the next iteration.
>
>
> Hopefully you can keep this sparse.  If you cannot, then the algorithm may
> not be suitable for use on large data no matter how you parallelize it.



Emm... I haven't figure out a good method yet.


>
> Skipping item 3 because I don't have time right now to analyze it in
> detail...
>




The partial sum method is invented and proofed in "Accuracy Estimate
and Optimization Techniques for SimRank Computation", VLDB2008.
http://modis.ispras.ru/Lizorkin/Publications/simrank_accuracy.pdf

In this paper there are 3 main optimization methods

"Selecting Essential Node Pairs" which is much like your oppinion that
to make the node pairs sparse. And I think it could be implemented in
Map nodes

"Partial Sums" mentioned in my last mail, could be implemented in Map nodes

"Threshold-Sieved Similarity", I think it could be implemented in Reduce nodes

Actually, this paper inspired me the idea to implement SimRank
algorithm in map-reduce frame :-)



>
> > 4, besides, there are other optimization methods such as threshold could be
> > used in Map nodes and Reduce nodes.
>
>
> Thresholding is likely to be a critical step in order to preserve sparsity.
>
>
> > 1, It is true that mapreduce could make the computation of each node more
> > easier. Yet if the volume of data is very huge, the transport latency of
> > data will become more and more serious.
>
>
> I think that you will find that with map-reduce in general and with Hadoop
> more specifically, that as the problem gets larger, the discipline imposed
> by map-reduce formulation on your data transport patterns actually allows
> better scaling than you would expect.  Of course, if your data size scales
> with n^2, you are in trouble no matter how your parallelize.
>
> A good example came a year or so ago with a machine translation group at a
> university in Maryland.  They had a large program that attempted to do
> coocurrence counting on text corpora using a single multi-core machine.
> They started to convert this to Hadoop using the simplest possible
> representation for the cooccurrence matrix (index, value triples) and
> expected that the redundancy of this representation would lead to very bad
> results.  Since they expected bad results, they also expected to do lots of
> optimization on the map-reduce version.  Also, since the original program
> was largely memory based, they expected that the communication overhead of
> hadoop would severely hurt performance.
>
> The actual results were that an 18 hour program run on 70 machines took 20
> minutes.  This is nearly perfect speedup over the sequential version.  The
> moral is that highly sequential transport of large blocks of information can
> be incredibly efficient.
>
> So, methods to reduce IO would be
> > very helpful.
>
>
> My first recommendation on this is to wait.  Get and implementation first,
> then optimize.  The problems you have will not be the problems you expect.
>


Yes, I think do some experiments will be much useful


>
> > 2, SimRank is to compute the similarity between all the nodes. If we map a
> > group of nodes {A, B, C} into one map node, and {D, E, F} into another map
> > node. The computation inside set {A, B, C} will be easy, so will be set {D,
> > E, F}. But when we want to compute SimRank between A and D, It will not be
> > very convenient.
>
>
> Map nodes should never communicate to each other.  That is the purpose of
> the reduce layer.
>
> I think that what you should do is organize your recursive step so that the
> sum happens in the reduce.  Then each mapper would output records where the
> key is the index pair for the summation (a and b in the notation used on
> wikipedia) and the reduce does this summation.  This implies that you
> change  your input format slightly to be variable length records containing
> a node index and the In set for that node.  This transformation is a very
> simple, one time map-reduce step.


yes! I didn't thought about that!
At first I considered that let graph nodes with their In nodes be the
input, then read the SimRank value needed for next iteration.

Do you mean that let the SimRank triples (nodeX, nodeY, R) to be the
Map input, then output (Ai, Bj) as key, R as value?
Ai belongs Out(X), Bj belongs Out(Y)

That will solve the problem!

>
> More specifically, you would have original input which initially has zero
> values for R:
>
>   links: (Node from, Node to, double R)
>
> and a transform MR step that does this to produce an auxilliary file
> inputSets: (Node to, List<Node> inputs):
>
>    map: (Node from, Node to) -> (to, from)
>    reduce: (Node to, List<Node> inputs) -> to, inputs
>
> Now you need to join the original input to the auxilliary file on both the
> from and to indexes.  This join would require two map-reduces, one to join
> on the from index and one to join on the to index.  The reduce in the final
> step should emit the cross product of the input sets.  Then you need to join
> that against the original data.  That join would require a single map-reduce
> for the join.  Finally, you need to group on the to index and sum up all of
> the distances and output a data set for the next round of computation.
>
> This is pretty complicated, but not all that hard to do.
>
> The key, as I mentioned before, will be to avoid catastrophic fill-in of
> your distance matrix.  If it becomes non-sparse, then the cartesian products
> in the joins above will be absolutely catastrophic because you will have far
> too much data flying around.  This problem is not specific to map-reduce, it
> is a problem with simRank itself.


maybe thresholding will do benifits


>
> I would also recommend that you look at some alternative distance measures.
> One simple one is just cooccurrence as filtered by a simple test.  This
> requires fewer map-reduce steps, does not require multiple iterations and
> cannot produce fill-in.  I have used iterative algorithms similar to
> simRank, but my personal experience is that this simpler algorithm produces
> recommendations on par with much more complex algorithms and that subsequent
> presentation level changes have more potential for improvement than
> algorithmic changes.
>
> I hope that this helps.
>
> --
> Ted Dunning, CTO
> DeepDyve


Thanks again for your advice :)

Re: [GSoC] SimRank algorithms on Mahout

Reply via email to