[ 
https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146731#comment-15146731
 ] 

Petar Zecevic commented on SPARK-13313:
---------------------------------------

Yes, you need articles.tsv and links.tsv from this archive: 
http://snap.stanford.edu/data/wikispeedia/wikispeedia_paths-and-graph.tar.gz

Then parse the data, assign IDs to article names and create the graph:
val articles = sc.textFile("articles.tsv", 6).filter(line => line.trim() != "" 
&& !line.startsWith("#")).zipWithIndex().cache()
val links = sc.textFile("links.tsv", 6).filter(line => line.trim() != "" && 
!line.startsWith("#"))
val linkIndexes = links.map(x => { val spl = x.split("\t"); (spl(0), spl(1)) 
}).join(articles).map(x => x._2).join(articles).map(x => x._2)
val wikigraph = Graph.fromEdgeTuples(linkIndexes, 0)

Then get strongly connected components:
val wikiSCC = wikigraph.stronglyConnectedComponents(100)

wikiSCC graph contains 519 SCCs, but there should be much more. The largest SCC 
in wikiSCC has 4051 vertices and that's obviously wrong.

The change in line 89, which I mentioned, seems to solve this problem, but then 
other issues arise (stack overflow etc) and I don't have time to investigate 
further. I hope someone will look into this.



> Strongly connected components doesn't find all strongly connected components
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-13313
>                 URL: https://issues.apache.org/jira/browse/SPARK-13313
>             Project: Spark
>          Issue Type: Bug
>          Components: GraphX
>    Affects Versions: 1.6.0
>            Reporter: Petar Zecevic
>
> Strongly connected components algorithm doesn't find all strongly connected 
> components. I was using Wikispeedia dataset 
> (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 
> SCCs and one of them had 4051 vertices, which in reality don't have any edges 
> between them. 
> I think the problem could be on line 89 of StronglyConnectedComponents.scala 
> file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe 
> the second Pregel call should use Out edge direction, the same as the first 
> call because the direction is reversed in the provided sendMsg function 
> (message is sent to source vertex and not destination vertex).
> If that is changed (line 89), the algorithm starts finding much more SCCs, 
> but eventually stack overflow exception occurs. I believe graph objects that 
> are changed through iterations should not be cached, but checkpointed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to