Re: comparison of desktop indexers

Richard Boulton Thu, 06 Sep 2007 01:34:17 -0700

Michal Pryc wrote:
> Hello,
> I've created small java application and posted it on my rarely updated
> blog, which grabs some text from wikipedia (MediaWiki) as it was wished
> on the tracker list:
> 
> http://blogs.sun.com/migi/entry/wikipedia_for_indexers_testing


I've not looked at your code in detail, but it looks like it crawls 
wikipedia for data.  This is strongly discouraged by the wikipedia 
admins.  See:
http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler

Instead, you should download one of the data dumps that wikipedia 
specifically make available for things like this.  The dumps are 
available as a simple XML format which is pretty easy to parse (I have a 
python script lying around somewhere which does this, but it's easy to 
write your own too).  There's a perl library which parses them linked 
from the Wikipedia::Database_download page.

The dump files are pretty huge; for example, the database download for 
the english wikipedia's current pages is 1.9GB compressed, at:
http://download.wikimedia.org/enwiki/20061130/enwiki-20061130-pages-articles.xml.bz2
I've been using this dataset for running performance tests on the Xapian 
search engine, but you might want to use a subset of the data for easier 
to run tests.

Apologies if your script already uses one of these downloads.


That said - wikipedia data makes an excellent test set in my opinion - 
go for it, but don't annoy the wikipedia admins in the process. :)

--
Richard
_______________________________________________
Dashboard-hackers mailing list
Dashboard-hackers@gnome.org
http://mail.gnome.org/mailman/listinfo/dashboard-hackers

Re: comparison of desktop indexers

Reply via email to