Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
The GraphX team has been using Wikipedia dumps from
http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
convenient format than the Freebase dumps. In particular, an article may
span multiple lines, so more involved input parsing is required.

Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML
parser from Mahout: see
WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and
WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala
.

However, we plan to upload a parsed version of this dataset to S3 for
easier access from Spark and GraphX.

Ankur http://www.ankurdave.com/

On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote:

I would like to run the
WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
but the Wikipedia dump XML files are no longer available on
 Freebase. Does anyone know an alternative source for the data?



Re: WikipediaPageRank Data Set

2014-03-30 Thread Ankur Dave
In particular, we are using this dataset:
http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Ankur http://www.ankurdave.com/


On Sun, Mar 30, 2014 at 12:45 AM, Ankur Dave ankurd...@gmail.com wrote:

 The GraphX team has been using Wikipedia dumps from
 http://dumps.wikimedia.org/enwiki/. Unfortunately, these are in a less
 convenient format than the Freebase dumps. In particular, an article may
 span multiple lines, so more involved input parsing is required.

 Dan Crankshaw (cc'd) wrote a driver that uses a Hadoop InputFormat XML
 parser from Mahout: see 
 WikiPipelineBenchmark.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiPipelineBenchmark.scala#L157and
 WikiArticle.scalahttps://github.com/amplab/graphx/blob/860918486a81cb4c88a056a9b64b1f7d8b0ed5ff/graphx/src/main/scala/org/apache/spark/graphx/WikiArticle.scala
 .

 However, we plan to upload a parsed version of this dataset to S3 for
 easier access from Spark and GraphX.

 Ankur http://www.ankurdave.com/

 On 27 Mar, 2014, at 9:45 pm, Niko Stahl r.niko.st...@gmail.com wrote:

 I would like to run the 
 WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
  but the Wikipedia dump XML files are no longer available on
 Freebase. Does anyone know an alternative source for the data?





WikipediaPageRank Data Set

2014-03-27 Thread Niko Stahl
Hello,

I would like to run the
WikipediaPageRankhttps://github.com/amplab/graphx/blob/f8544981a6d05687fa950639cb1eb3c31e9b6bf5/examples/src/main/scala/org/apache/spark/examples/bagel/WikipediaPageRank.scalaexample,
but the Wikipedia dump XML files are no longer available on
Freebase. Does anyone know an alternative source for the data?

Thanks,
Niko