Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
We did have a related issue (https://issues.apache.org/jira/browse/GIRAPH-155). On 5/29/12 6:54 AM, Claudio Martella wrote: I'm not sure they will be needed to send them on the first superstep. They'll be created and used in the second superstep if necessary. If they need it in the first superstep, then i guess they'll put them as a line in the inputfile. I agree with you that this is kind of messed up :) On Tue, May 29, 2012 at 3:23 PM, Sebastian Schelter wrote: Oh sorry, I didn't know that discussion. The problem I see is that in every implementation, a user might run into this issue, and I don't think its ideal to force users to always run a round of sending empty messages at the beginning. Maybe the system should (somehow) automagically do that for the users? Really seems to be an awkward situation though... --sebastian On 29.05.2012 15:03, Claudio Martella wrote: About the mapreduce job to prepare the inputset, I did advocate for this solution instead of supporting automatic creation of non-existent vertices implicitly (which I believe adds a logical path in vertex resolution which has some drawbacks e.g you have to check in the hashmap for the existence of the destination vertex for each message, which is "fine" now that it's a hashmap, but it's going to be less fine when/if we turn to TreeMap for out-of-core). Unfortunately the other committers preferred going for the path that helps userland's life, so I guess this solution is not to be considered here either. On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter wrote: On 29.05.2012 13:13, Paolo Castagna wrote: Hi Sebastian Sebastian Schelter wrote: Why do you only recompute the pageRank in each second superstep? Can we not use the aggregated value of the dangling nodes from the last superstep? I removed the computing of PageRank values every each second superstep. However, I needed to use a couple of aggregators for the dangling nodes contribution instead of just one: "dangling-current" and "dangling-previous". Each superstep, I need to reset the dangling-current aggregator, at the same time, I need to know the value of the aggregator at a previous superstep. You can save the value from the previous step in a static variable in the WorkerContext before resetting the aggregator. I hope it makes sense, let me know if you have a better idea. Overall I think we're on a good way to a robust, real-world PageRank implementation, I managed to implement the convergence check with an aggregator, will post an updated patch soon. I think I've just done it, have a look [1] and let me know if you would have done it differently. Paolo [1] https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
I'm not sure they will be needed to send them on the first superstep. They'll be created and used in the second superstep if necessary. If they need it in the first superstep, then i guess they'll put them as a line in the inputfile. I agree with you that this is kind of messed up :) On Tue, May 29, 2012 at 3:23 PM, Sebastian Schelter wrote: > Oh sorry, I didn't know that discussion. The problem I see is that in > every implementation, a user might run into this issue, and I don't > think its ideal to force users to always run a round of sending empty > messages at the beginning. > > Maybe the system should (somehow) automagically do that for the users? > Really seems to be an awkward situation though... > > --sebastian > > > > On 29.05.2012 15:03, Claudio Martella wrote: >> About the mapreduce job to prepare the inputset, I did advocate for >> this solution instead of supporting automatic creation of non-existent >> vertices implicitly (which I believe adds a logical path in vertex >> resolution which has some drawbacks e.g you have to check in the >> hashmap for the existence of the destination vertex for each message, >> which is "fine" now that it's a hashmap, but it's going to be less >> fine when/if we turn to TreeMap for out-of-core). >> >> Unfortunately the other committers preferred going for the path that >> helps userland's life, so I guess this solution is not to be >> considered here either. >> >> On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter wrote: >>> On 29.05.2012 13:13, Paolo Castagna wrote: Hi Sebastian Sebastian Schelter wrote: > Why do you only recompute the pageRank in each second superstep? Can we > not use the aggregated value of the dangling nodes from the last > superstep? I removed the computing of PageRank values every each second superstep. However, I needed to use a couple of aggregators for the dangling nodes contribution instead of just one: "dangling-current" and "dangling-previous". Each superstep, I need to reset the dangling-current aggregator, at the same time, I need to know the value of the aggregator at a previous superstep. >>> >>> You can save the value from the previous step in a static variable in >>> the WorkerContext before resetting the aggregator. >>> I hope it makes sense, let me know if you have a better idea. > Overall I think we're on a good way to a robust, real-world PageRank > implementation, I managed to implement the convergence check with an > aggregator, will post an updated patch soon. I think I've just done it, have a look [1] and let me know if you would have done it differently. Paolo [1] https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90 >>> >> >> >> > -- Claudio Martella claudio.marte...@gmail.com
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Oh sorry, I didn't know that discussion. The problem I see is that in every implementation, a user might run into this issue, and I don't think its ideal to force users to always run a round of sending empty messages at the beginning. Maybe the system should (somehow) automagically do that for the users? Really seems to be an awkward situation though... --sebastian On 29.05.2012 15:03, Claudio Martella wrote: > About the mapreduce job to prepare the inputset, I did advocate for > this solution instead of supporting automatic creation of non-existent > vertices implicitly (which I believe adds a logical path in vertex > resolution which has some drawbacks e.g you have to check in the > hashmap for the existence of the destination vertex for each message, > which is "fine" now that it's a hashmap, but it's going to be less > fine when/if we turn to TreeMap for out-of-core). > > Unfortunately the other committers preferred going for the path that > helps userland's life, so I guess this solution is not to be > considered here either. > > On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter wrote: >> On 29.05.2012 13:13, Paolo Castagna wrote: >>> Hi Sebastian >>> >>> Sebastian Schelter wrote: Why do you only recompute the pageRank in each second superstep? Can we not use the aggregated value of the dangling nodes from the last superstep? >>> >>> I removed the computing of PageRank values every each second superstep. >>> However, I needed to use a couple of aggregators for the dangling nodes >>> contribution instead of just one: "dangling-current" and >>> "dangling-previous". >>> >>> Each superstep, I need to reset the dangling-current aggregator, at the >>> same time, I need to know the value of the aggregator at a previous >>> superstep. >> >> You can save the value from the previous step in a static variable in >> the WorkerContext before resetting the aggregator. >> >>> >>> I hope it makes sense, let me know if you have a better idea. >>> Overall I think we're on a good way to a robust, real-world PageRank implementation, I managed to implement the convergence check with an aggregator, will post an updated patch soon. >>> >>> I think I've just done it, have a look [1] and let me know if you would have >>> done it differently. >>> >>> Paolo >>> >>> [1] >>> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90 >>> >>> >> > > >
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
About the mapreduce job to prepare the inputset, I did advocate for this solution instead of supporting automatic creation of non-existent vertices implicitly (which I believe adds a logical path in vertex resolution which has some drawbacks e.g you have to check in the hashmap for the existence of the destination vertex for each message, which is "fine" now that it's a hashmap, but it's going to be less fine when/if we turn to TreeMap for out-of-core). Unfortunately the other committers preferred going for the path that helps userland's life, so I guess this solution is not to be considered here either. On Tue, May 29, 2012 at 1:48 PM, Sebastian Schelter wrote: > On 29.05.2012 13:13, Paolo Castagna wrote: >> Hi Sebastian >> >> Sebastian Schelter wrote: >>> Why do you only recompute the pageRank in each second superstep? Can we >>> not use the aggregated value of the dangling nodes from the last superstep? >> >> I removed the computing of PageRank values every each second superstep. >> However, I needed to use a couple of aggregators for the dangling nodes >> contribution instead of just one: "dangling-current" and "dangling-previous". >> >> Each superstep, I need to reset the dangling-current aggregator, at the >> same time, I need to know the value of the aggregator at a previous >> superstep. > > You can save the value from the previous step in a static variable in > the WorkerContext before resetting the aggregator. > >> >> I hope it makes sense, let me know if you have a better idea. >> >>> Overall I think we're on a good way to a robust, real-world PageRank >>> implementation, I managed to implement the convergence check with an >>> aggregator, will post an updated patch soon. >> >> I think I've just done it, have a look [1] and let me know if you would have >> done it differently. >> >> Paolo >> >> [1] >> https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90 >> >> > -- Claudio Martella claudio.marte...@gmail.com
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
On 29.05.2012 13:13, Paolo Castagna wrote: > Hi Sebastian > > Sebastian Schelter wrote: >> Why do you only recompute the pageRank in each second superstep? Can we >> not use the aggregated value of the dangling nodes from the last superstep? > > I removed the computing of PageRank values every each second superstep. > However, I needed to use a couple of aggregators for the dangling nodes > contribution instead of just one: "dangling-current" and "dangling-previous". > > Each superstep, I need to reset the dangling-current aggregator, at the > same time, I need to know the value of the aggregator at a previous > superstep. You can save the value from the previous step in a static variable in the WorkerContext before resetting the aggregator. > > I hope it makes sense, let me know if you have a better idea. > >> Overall I think we're on a good way to a robust, real-world PageRank >> implementation, I managed to implement the convergence check with an >> aggregator, will post an updated patch soon. > > I think I've just done it, have a look [1] and let me know if you would have > done it differently. > > Paolo > > [1] > https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90 > >
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Sebastian Sebastian Schelter wrote: > Why do you only recompute the pageRank in each second superstep? Can we > not use the aggregated value of the dangling nodes from the last superstep? I removed the computing of PageRank values every each second superstep. However, I needed to use a couple of aggregators for the dangling nodes contribution instead of just one: "dangling-current" and "dangling-previous". Each superstep, I need to reset the dangling-current aggregator, at the same time, I need to know the value of the aggregator at a previous superstep. I hope it makes sense, let me know if you have a better idea. > Overall I think we're on a good way to a robust, real-world PageRank > implementation, I managed to implement the convergence check with an > aggregator, will post an updated patch soon. I think I've just done it, have a look [1] and let me know if you would have done it differently. Paolo [1] https://github.com/castagna/jena-grande/blob/11f07dd897562f7a4bf8d6e4845128d7f2cdd2ff/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java#L90
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Sebastian Schelter wrote: > However, the problem with this input is that the dangling vertices that > don't have a line of their own (such as 11) cannot contribute their > accumulated rank, as no vertex for them will be instantiated. So > counting them doesn't help either. No, the 'implicit' dangling nodes (such as 6, 7, 9 and 11 below) are instantiated when you send a message to them. If you run the example, you'll see that after the first superstep there are 11 vertices which are sending and receiving messages (as it should be with correct input). > I think that we should rely on users supplying valid input (a line for > each vertex) and not try to correct for that in the vertex class. Well, I don't disagree in principle. But in practice this won't stop users making mistakes and provide your software with bad data as input. :-) One superstep for cleaning/validating the input data isn't that bad after all. > Creating a line for each vertex from such a file is an easy task that is > doable with a single MapReduce pass over the data beforehand. Sure. (Why is this better than a superstep with Giraph?) ;-) Paolo
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Paolo, I see what you mean. However, the problem with this input is that the dangling vertices that don't have a line of their own (such as 11) cannot contribute their accumulated rank, as no vertex for them will be instantiated. So counting them doesn't help either. I think that we should rely on users supplying valid input (a line for each vertex) and not try to correct for that in the vertex class. Creating a line for each vertex from such a file is an easy task that is doable with a single MapReduce pass over the data beforehand. --sebastian On 28.05.2012 22:13, Paolo Castagna wrote: > Hi Sebastian, > you can try yourself with some simple input (which contains dangling nodes). > > For example, say I have this adjacency list (with mistakes/repetitions and > self-links, ignore those): > > 1 1 2 2 > 2 3 5 7 9 11 > 3 3 3 6 > 4 > 5 1 2 3 11 > 8 10 > 10 5 > > How many vertices? 7 or 11? > > I think this graph has 11 and it's 11 you need to use as number of vertices > when you compute PageRank. > > Paolo > > Sebastian Schelter wrote: >> Hi Paolo, >> >> Why would getNumVertices() not give back the correct number of vertices? >> This call should always give back the overall number of vertices (if it >> doesn't we have to fix it) and you shouldn't have to rely on tricks to >> count stuff via aggregators. >> >> --sebastian > >
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Sebastian, you can try yourself with some simple input (which contains dangling nodes). For example, say I have this adjacency list (with mistakes/repetitions and self-links, ignore those): 1 1 2 2 2 3 5 7 9 11 3 3 3 6 4 5 1 2 3 11 8 10 10 5 How many vertices? 7 or 11? I think this graph has 11 and it's 11 you need to use as number of vertices when you compute PageRank. Paolo Sebastian Schelter wrote: > Hi Paolo, > > Why would getNumVertices() not give back the correct number of vertices? > This call should always give back the overall number of vertices (if it > doesn't we have to fix it) and you shouldn't have to rely on tricks to > count stuff via aggregators. > > --sebastian
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Paolo, Why would getNumVertices() not give back the correct number of vertices? This call should always give back the overall number of vertices (if it doesn't we have to fix it) and you shouldn't have to rely on tricks to count stuff via aggregators. --sebastian On 28.05.2012 21:20, Paolo Castagna wrote: > Hi Sebastian > > Sebastian Schelter wrote: >> I think the code can be improved partially. You don't have to count the >> vertices via an aggregator, you can simply use getNumVertices(). > > No, you cannot use getNumVertices() since it won't count dangling nodes. > If you want to count for dangling nodes you need to send a message to all > nodes. > >> Why do you only recompute the pageRank in each second superstep? Can we >> not use the aggregated value of the dangling nodes from the last superstep? > > Maybe, but I did not managed to do it. > > The dangling node contribution needs to be recomputed each time, so the > aggregator needs to be reset to 0 each time and kept unchanged during each > superstep. The value is the sum of the PageRank values of all dangling nodes > in the previous superstep. > > The good thing is that now there is a 'correct' implementation and a test > we can use to verify that PageRank values are correct (against a third > party implementation: JUNG). > >> Overall I think we're on a good way to a robust, real-world PageRank >> implementation, I managed to implement the convergence check with an >> aggregator, will post an updated patch soon. > > Thanks, that was my next step. :-) > + dumping factor configurable and general clean-up. > > Paolo > >> >> Best, >> Sebastian >> >> >> On 28.05.2012 18:39, Paolo Castagna wrote: >>> Paolo Castagna wrote: Sebastian Schelter wrote: > I guess that summing up and redistributing the pagerank of dangling > vertices can also be done without an extra superstep in an aggregator. Yeah! Why didn't I think about that? Thanks, great suggestion. I am going to give this a go, at first without extending RandomWalkVertex since I want to see how it might work. This would also inform design and improvements of the current RandomWalkVertex. >>> You can find a 'proper' ;-) implementation of PageRank with dangling nodes >>> support, sum at the end of all PageRank values equals to 1.00 (since it's a >>> probability distribution) and PageRank values validated against a third >>> party >>> implementation (i.e. JUNG), here [1]. >>> >>> I have not managed to do it as Sebastian suggested yet. >>> >>> Paolo >>> >>> [1] >>> https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java >>> >>
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Sebastian Sebastian Schelter wrote: > Could we try to merge this with the patch from > https://issues.apache.org/jira/browse/GIRAPH-191 ? I'll look at doing that, but so far I did not managed to have a correct implementation. Now, that I have an implementation which I know to be correct, I'll try using RandomWalkVertex. Paolo
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Sebastian Sebastian Schelter wrote: > I think the code can be improved partially. You don't have to count the > vertices via an aggregator, you can simply use getNumVertices(). No, you cannot use getNumVertices() since it won't count dangling nodes. If you want to count for dangling nodes you need to send a message to all nodes. > Why do you only recompute the pageRank in each second superstep? Can we > not use the aggregated value of the dangling nodes from the last superstep? Maybe, but I did not managed to do it. The dangling node contribution needs to be recomputed each time, so the aggregator needs to be reset to 0 each time and kept unchanged during each superstep. The value is the sum of the PageRank values of all dangling nodes in the previous superstep. The good thing is that now there is a 'correct' implementation and a test we can use to verify that PageRank values are correct (against a third party implementation: JUNG). > Overall I think we're on a good way to a robust, real-world PageRank > implementation, I managed to implement the convergence check with an > aggregator, will post an updated patch soon. Thanks, that was my next step. :-) + dumping factor configurable and general clean-up. Paolo > > Best, > Sebastian > > > On 28.05.2012 18:39, Paolo Castagna wrote: >> Paolo Castagna wrote: >>> Sebastian Schelter wrote: I guess that summing up and redistributing the pagerank of dangling vertices can also be done without an extra superstep in an aggregator. >>> Yeah! Why didn't I think about that? >>> Thanks, great suggestion. >>> >>> I am going to give this a go, at first without extending RandomWalkVertex >>> since I want to see how it might work. >>> This would also inform design and improvements of the current >>> RandomWalkVertex. >> You can find a 'proper' ;-) implementation of PageRank with dangling nodes >> support, sum at the end of all PageRank values equals to 1.00 (since it's a >> probability distribution) and PageRank values validated against a third party >> implementation (i.e. JUNG), here [1]. >> >> I have not managed to do it as Sebastian suggested yet. >> >> Paolo >> >> [1] >> https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java >> >
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
I think the code can be improved partially. You don't have to count the vertices via an aggregator, you can simply use getNumVertices(). Why do you only recompute the pageRank in each second superstep? Can we not use the aggregated value of the dangling nodes from the last superstep? Overall I think we're on a good way to a robust, real-world PageRank implementation, I managed to implement the convergence check with an aggregator, will post an updated patch soon. Best, Sebastian On 28.05.2012 18:39, Paolo Castagna wrote: > Paolo Castagna wrote: >> Sebastian Schelter wrote: >>> I guess that summing up and redistributing the pagerank of dangling >>> vertices can also be done without an extra superstep in an aggregator. >> >> Yeah! Why didn't I think about that? >> Thanks, great suggestion. >> >> I am going to give this a go, at first without extending RandomWalkVertex >> since I want to see how it might work. >> This would also inform design and improvements of the current >> RandomWalkVertex. > > You can find a 'proper' ;-) implementation of PageRank with dangling nodes > support, sum at the end of all PageRank values equals to 1.00 (since it's a > probability distribution) and PageRank values validated against a third party > implementation (i.e. JUNG), here [1]. > > I have not managed to do it as Sebastian suggested yet. > > Paolo > > [1] > https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java >
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Hi Paolo, Could we try to merge this with the patch from https://issues.apache.org/jira/browse/GIRAPH-191 ? Best, Sebastian On 28.05.2012 18:39, Paolo Castagna wrote: > Paolo Castagna wrote: >> Sebastian Schelter wrote: >>> I guess that summing up and redistributing the pagerank of dangling >>> vertices can also be done without an extra superstep in an aggregator. >> >> Yeah! Why didn't I think about that? >> Thanks, great suggestion. >> >> I am going to give this a go, at first without extending RandomWalkVertex >> since I want to see how it might work. >> This would also inform design and improvements of the current >> RandomWalkVertex. > > You can find a 'proper' ;-) implementation of PageRank with dangling nodes > support, sum at the end of all PageRank values equals to 1.00 (since it's a > probability distribution) and PageRank values validated against a third party > implementation (i.e. JUNG), here [1]. > > I have not managed to do it as Sebastian suggested yet. > > Paolo > > [1] > https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java >
Re: SimplePageRankVertex implementation, dangling nodes and sending messages to all nodes...
Paolo Castagna wrote: > Sebastian Schelter wrote: >> I guess that summing up and redistributing the pagerank of dangling >> vertices can also be done without an extra superstep in an aggregator. > > Yeah! Why didn't I think about that? > Thanks, great suggestion. > > I am going to give this a go, at first without extending RandomWalkVertex > since I want to see how it might work. > This would also inform design and improvements of the current > RandomWalkVertex. You can find a 'proper' ;-) implementation of PageRank with dangling nodes support, sum at the end of all PageRank values equals to 1.00 (since it's a probability distribution) and PageRank values validated against a third party implementation (i.e. JUNG), here [1]. I have not managed to do it as Sebastian suggested yet. Paolo [1] https://github.com/castagna/jena-grande/blob/3c2d9f85bacb737acd575e3e287dc0fcc6bd96b9/src/main/java/org/apache/jena/grande/giraph/pagerank/PageRankVertex.java