[ https://issues.apache.org/jira/browse/SPARK-13313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15146731#comment-15146731 ]
Petar Zecevic commented on SPARK-13313: --------------------------------------- Yes, you need articles.tsv and links.tsv from this archive: http://snap.stanford.edu/data/wikispeedia/wikispeedia_paths-and-graph.tar.gz Then parse the data, assign IDs to article names and create the graph: val articles = sc.textFile("articles.tsv", 6).filter(line => line.trim() != "" && !line.startsWith("#")).zipWithIndex().cache() val links = sc.textFile("links.tsv", 6).filter(line => line.trim() != "" && !line.startsWith("#")) val linkIndexes = links.map(x => { val spl = x.split("\t"); (spl(0), spl(1)) }).join(articles).map(x => x._2).join(articles).map(x => x._2) val wikigraph = Graph.fromEdgeTuples(linkIndexes, 0) Then get strongly connected components: val wikiSCC = wikigraph.stronglyConnectedComponents(100) wikiSCC graph contains 519 SCCs, but there should be much more. The largest SCC in wikiSCC has 4051 vertices and that's obviously wrong. The change in line 89, which I mentioned, seems to solve this problem, but then other issues arise (stack overflow etc) and I don't have time to investigate further. I hope someone will look into this. > Strongly connected components doesn't find all strongly connected components > ---------------------------------------------------------------------------- > > Key: SPARK-13313 > URL: https://issues.apache.org/jira/browse/SPARK-13313 > Project: Spark > Issue Type: Bug > Components: GraphX > Affects Versions: 1.6.0 > Reporter: Petar Zecevic > > Strongly connected components algorithm doesn't find all strongly connected > components. I was using Wikispeedia dataset > (http://snap.stanford.edu/data/wikispeedia.html) and the algorithm found 519 > SCCs and one of them had 4051 vertices, which in reality don't have any edges > between them. > I think the problem could be on line 89 of StronglyConnectedComponents.scala > file where EdgeDirection.In should be changed to EdgeDirection.Out. I believe > the second Pregel call should use Out edge direction, the same as the first > call because the direction is reversed in the provided sendMsg function > (message is sent to source vertex and not destination vertex). > If that is changed (line 89), the algorithm starts finding much more SCCs, > but eventually stack overflow exception occurs. I believe graph objects that > are changed through iterations should not be cached, but checkpointed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org