Is there a way to pick a specific day, versus "latest". How long
does Wikipedia archive? Always using the latest makes comparisons
more difficult. I wonder if licensing terms would allow us to host a
specific date of the version on Lucene zones. Of course, that may
not be a good idea bandwidth wise. I'm open to suggestions. Maybe
using the latest isn't that big of a deal.
On Apr 24, 2007, at 2:45 PM, Steven Parkes (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-848?
page=com.atlassian.jira.plugin.system.issuetabpanels:comment-
tabpanel#action_12491396 ]
Steven Parkes commented on LUCENE-848:
--------------------------------------
Yeah, it takes a while to download.
I added the jars since that's what we've been doing elsewhere. In
fact, xerces is in gdata-server too. Personally, the size isn't an
issue for me; don't know about others. What might be difficult,
though, is trying to share the two since that would mean
coordinating contrib projects, and I don't know anything about the
gdata server. I can tell you that if you want to support both 1.4
and 1.5 on something as big wikipedia, there is sensitivity to the
xerces revision.
Sorry about the download problem, Grant. I actually documented that
in a readme ... hat I can no longer find. I would swear I put it in
the patch but obviously I didn't becuase it's not there. Now I have
to go find it.
The short answer is you want to download http://
download.wikimedia.org/enwiki/20070402/enwiki-20070402-pages-
articles.xml.bz2. The wikipedia download site isn't always clean,
doesn't have files where they "should" be. It was when I first
started this, but isn't now.
Add supported for Wikipedia English as a corpus in the benchmarker
stuff
---------------------------------------------------------------------
---
Key: LUCENE-848
URL: https://issues.apache.org/jira/browse/LUCENE-848
Project: Lucene - Java
Issue Type: New Feature
Components: contrib/benchmark
Reporter: Steven Parkes
Assigned To: Grant Ingersoll
Priority: Minor
Fix For: 2.2
Attachments: LUCENE-848.txt, LUCENE-848.txt,
LUCENE-848.txt, LUCENE-848.txt, WikipediaHarvester.java,
xerces.jar, xerces.jar, xml-apis.jar
Add support for using Wikipedia for benchmarking.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]