It might be worth noting that Freebase publishes a Text only extract of Wikipedia: http://download.freebase.com/wex/latest/ We could take a snapshot of that and host it somewhere as the new standard for benchmarking.
On Jan 31, 2011, at 2:20 PM, mikemcc...@apache.org wrote: > Author: mikemccand > Date: Mon Jan 31 19:20:34 2011 > New Revision: 1065719 > > URL: http://svn.apache.org/viewvc?rev=1065719&view=rev > Log: > LUCENE-1591: rollback to old patched xercesImpl.jar to workaround > XERCESJ-1257, which we hit on current Wikipedia XML export > (enwiki-20110115-pages-articles.xml) > > Added: > > lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar > (with props) > lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.9.0.jar (with props) > Removed: > lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.10.0.jar > lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.10.0.jar > > Added: > lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar > URL: > http://svn.apache.org/viewvc/lucene/dev/trunk/modules/benchmark/lib/xercesImpl-2.9.1-patched-XERCESJ-1257.jar?rev=1065719&view=auto > ============================================================================== > Binary file - no diff available. > > Added: lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.9.0.jar > URL: > http://svn.apache.org/viewvc/lucene/dev/trunk/modules/benchmark/lib/xml-apis-2.9.0.jar?rev=1065719&view=auto > ============================================================================== > Binary file - no diff available. > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org