[ https://issues.apache.org/jira/browse/LUCENE-848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12508833 ]
Steven Parkes commented on LUCENE-848: -------------------------------------- Trying to reproduce now. Something that came up while restarting the fetch/decompress/etc. was the number of files this procedure creates. It's a lot: one for each article. I used the existing benchmark code for doing this stuff but perhaps it's not a good idea on this scale? For one thing, it kinda kills ant since ant wants to do a walk of subtrees for some of its tasks. Either we need to exclude the work and temp directories from ant's walks and/or we should come up with something better than one file per article. I think Mike mentioned not doing the one file per article. I'll try to look at that ... > Add supported for Wikipedia English as a corpus in the benchmarker stuff > ------------------------------------------------------------------------ > > Key: LUCENE-848 > URL: https://issues.apache.org/jira/browse/LUCENE-848 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/benchmark > Reporter: Steven Parkes > Assignee: Grant Ingersoll > Priority: Minor > Attachments: LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, > LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, LUCENE-848.txt, > WikipediaHarvester.java, xerces.jar, xerces.jar, xml-apis.jar > > > Add support for using Wikipedia for benchmarking. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]