Using Lucene to index Wikipedia

Daniel Quach Thu, 20 Oct 2011 09:30:38 -0700

How do I use the Lucene Benchmark to index a wikipedia dump? I want tobe able to execute phrase queries on the latest english wikipedia pagedump. I'm trying to look for example use cases but I haven't found any.

I downloaded the latest english dump, named:enwiki-latest-pages-articles.xml.bz2

Then I ran the command in the terminal:

java org.apache.lucene.benchmark.utils.ExtractWikipedia -i~/enwiki-latest-pages-articles.xml.bz2


which I believe extracted the pages into a directory labeled "enwiki"

Now is there something else in benchmarks that I need to run in order toindex the wiki? The README.enwiki does not really give me a clear set ofinstructions, in fact I'm not even sure if I was supposed to run theExtractWikipedia class or not.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Using Lucene to index Wikipedia

Reply via email to