Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-25 Thread Ankur Dave
I posted the fix on the JIRA ticket 
(https://issues.apache.org/jira/browse/SPARK-3190). To update the user list, 
this is indeed an integer overflow problem when summing up the partition sizes. 
The fix is to use Longs for the sum: https://github.com/apache/spark/pull/2106.

Ankur


-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-22 Thread Jeffrey Picard
I’m seeing this issue also. I have graph with with 5828339535 vertices and 
7398447992 edges, graph.numVertices returns 1533266498 and graph.numEdges is 
correct and returns 7398447992. I also am having an issue that I’m beginning to 
suspect is caused by the same underlying problem where connected components 
stops after one iteration, returning an incorrect graph.
On Aug 22, 2014, at 8:43 PM, npanj  wrote:

> While creating a graph with 6B nodes and 12B edges, I noticed that
> *'numVertices' api returns incorrect result*; 'numEdges' reports correct
> number. For few times(with different dataset > 2.5B nodes) I have also
> notices that numVertices is returned as -ive number; so I suspect that there
> is some overflow (may be we are using Int for some field?).
> 
> Environment: Standalone mode running on EC2 . Using latest code from master
> branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
> 
> Here is some details of experiments I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
> Graph returns: numVertices=1807028297 ; numEdges=12163784626
> 2. Input : numNodes=*2157586441* ; noEdges=2747322705
> Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
> Graph: numVertices=1725060105 ; numEdges=2041768213 
> 
> 
> You can find the code to generate this bug here:
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> 
> (I have also filed this jira ticket:
> https://issues.apache.org/jira/browse/SPARK-3190)
> 
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 



signature.asc
Description: Message signed with OpenPGP using GPGMail


Graphx seems to be broken while Creating a large graph(6B nodes in my case)

2014-08-22 Thread npanj
While creating a graph with 6B nodes and 12B edges, I noticed that
*'numVertices' api returns incorrect result*; 'numEdges' reports correct
number. For few times(with different dataset > 2.5B nodes) I have also
notices that numVertices is returned as -ive number; so I suspect that there
is some overflow (may be we are using Int for some field?).

Environment: Standalone mode running on EC2 . Using latest code from master
branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .

Here is some details of experiments I have done so far: 
1. Input: numNodes=6101995593 ; noEdges=12163784626
Graph returns: numVertices=1807028297 ; numEdges=12163784626
2. Input : numNodes=*2157586441* ; noEdges=2747322705
Graph Returns: numVertices=*-2137380855* ; numEdges=2747322705
3. Input: numNodes=1725060105 ; noEdges=204176821
Graph: numVertices=1725060105 ; numEdges=2041768213 


You can find the code to generate this bug here:
https://gist.github.com/npanj/92e949d86d08715bf4bf

(I have also filed this jira ticket:
https://issues.apache.org/jira/browse/SPARK-3190)





--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Graphx-seems-to-be-broken-while-Creating-a-large-graph-6B-nodes-in-my-case-tp7966.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org