Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)
I posted the fix on the JIRA ticket (https://issues.apache.org/jira/browse/SPARK-3190). To update the user list, this is indeed an integer overflow problem when summing up the partition sizes. The fix is to use Longs for the sum: https://github.com/apache/spark/pull/2106. Ankur - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)
I’m seeing this issue also. I have graph with with 5828339535 vertices and 7398447992 edges, graph.numVertices returns 1533266498 and graph.numEdges is correct and returns 7398447992. I also am having an issue that I’m beginning to suspect is caused by the same underlying problem where connected components stops after one iteration, returning an incorrect graph. On Aug 22, 2014, at 8:43 PM, npanj wrote: > While creating a graph with 6B nodes and 12B edges, I noticed that > *'numVertices' api returns incorrect result*; 'numEdges' reports correct > number. For few times(with different dataset > 2.5B nodes) I have also > notices that numVertices is returned as -ive number; so I suspect that there > is some overflow (may be we are using Int for some field?). > > Environment: Standalone mode running on EC2 . Using latest code from master > branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 . > > Here is some details of experiments I have done so far: > 1. Input: numNodes=6101995593 ; noEdges=12163784626 > Graph returns: numVertices=1807028297 ; numEdges=12163784626 > 2. Input : numNodes=*2157586441* ; noEdges=2747322705 > Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705 > 3. Input: numNodes=1725060105 ; noEdges=204176821 > Graph: numVertices=1725060105 ; numEdges=2041768213 > > > You can find the code to generate this bug here: > https://gist.github.com/npanj/92e949d86d08715bf4bf > > (I have also filed this jira ticket: > https://issues.apache.org/jira/browse/SPARK-3190) > > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > signature.asc Description: Message signed with OpenPGP using GPGMail
Graphx seems to be broken while Creating a large graph(6B nodes in my case)
While creating a graph with 6B nodes and 12B edges, I noticed that *'numVertices' api returns incorrect result*; 'numEdges' reports correct number. For few times(with different dataset > 2.5B nodes) I have also notices that numVertices is returned as -ive number; so I suspect that there is some overflow (may be we are using Int for some field?). Environment: Standalone mode running on EC2 . Using latest code from master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 . Here is some details of experiments I have done so far: 1. Input: numNodes=6101995593 ; noEdges=12163784626 Graph returns: numVertices=1807028297 ; numEdges=12163784626 2. Input : numNodes=*2157586441* ; noEdges=2747322705 Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705 3. Input: numNodes=1725060105 ; noEdges=204176821 Graph: numVertices=1725060105 ; numEdges=2041768213 You can find the code to generate this bug here: https://gist.github.com/npanj/92e949d86d08715bf4bf (I have also filed this jira ticket: https://issues.apache.org/jira/browse/SPARK-3190) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org