[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486837 ]
Karl Wettin commented on LUCENE-848: ------------------------------------ > Karl, it looks like your stuff grabs individual articles, right? I'm gong to > have it download the bzip2 snapshots they provide (and that they prefer you > use, if you're getting much). They also supply the rendered HTML every now and then. It should be enough to change the URL pattern to file:///tmp/wikipedia/. I was considering porting the MediaWiki BNF as a tokenizer, but found it much simpler to just parse the HTML. > Add supported for Wikipedia English as a corpus in the benchmarker stuff > ------------------------------------------------------------------------ > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark > Reporter: Steven Parkes > Assigned To: Steven Parkes > Priority: Minor > Fix For: 2.2 > > Attachments: WikipediaHarvester.java > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]