Re: WikipediaPageRank Data Set
The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML parser from Mahout: see WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala . However, we plan to upload a parsed version of this dataset to S3 for easier access from Spark and GraphX. Ankur http://www.ankurdave.com/ On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote: I would like to run the WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone know an alternative source for the data?
Re: WikipediaPageRank Data Set
In particular, we are using this dataset: http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 Ankur http://www.ankurdave.com/ On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote: The GraphX team has been using Wikipedia dumps from http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less convenient format than the Freebase dumps. In particular, an article may span multiple lines, so more involved input parsing is required. Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML parser from Mahout: see WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala . However, we plan to upload a parsed version of this dataset to S3 for easier access from Spark and GraphX. Ankur http://www.ankurdave.com/ On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote: I would like to run the WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone know an alternative source for the data?
WikipediaPageRank Data Set
Hello, I would like to run the WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample, but the Wikipedia dump XML files are no longer available on Freebase. Does anyone know an alternative source for the data? Thanks, Niko