[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508661 ]
Steven Parkes commented on LUCENE-848: -------------------------------------- Actually, I just noticed wikimedia provides the md5 hashes. I was able to validate my copy. I don't actually remember if I got my copy from wikimedia or from p.a.o. The copy in your ls -l looks bad, both from the sha1sum and from the size. Looks like your file is truncated: the file length is 455M (if 477278208 is the size in bytes) and the real file is 2686431976 (2.6G) bytes. Can you check the file on p.a.o, both the size and the md5 hash? The latter should be fc24229da9af033cbb55b9867a950431 (http://download.wikimedia.org/enwiki/20070527/enwiki-20070527-md5sums.txt) I should be able to launch a test of the unzip/extract tonight. It takes a while. > Add supported for Wikipedia English as a corpus in the benchmarker stuff > ------------------------------------------------------------------------ > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark > Reporter: Steven Parkes > Assignee: Grant Ingersoll > Priority: Minor > Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, > LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, > WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]