RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-02 Thread Steven Parkes
I checked and there are escape sequences in there. If it was ever debatable, I think that tips it in favor of SAX. xerces? The contrib/gdata stuff seems to use it. I suppose if I'm careful and creative enough, we could share a lot of the code amongst benchmark ingesters that use XML, should there

RE: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-02 Thread Steven Parkes
Yes, indeed. May not be necessary initially, but we could support XPath or something down the road to allow us to specify what things I wouldn't worry about generalizing too much to start with. Once we have a couple collections then we can go that route. My thoughts, too. I've been

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-04-02 Thread Marvin Humphrey
On Apr 2, 2007, at 2:50 PM, Steven Parkes wrote: On the one hand, creating separate per-article files is clean in that when you then ingest, you only have disk i/o that's going to affect the ingest performance (as opposed to, say, uncompressing/parsing). On the other hand, that's a lot of

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-03-28 Thread Grant Ingersoll
On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote: [ https://issues.apache.org/jira/browse/LUCENE-848? page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steven Parkes updated LUCENE-848: - Description: Add support for

Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff

2007-03-28 Thread Doron Cohen
Grant Ingersoll [EMAIL PROTECTED] wrote on 28/03/2007 10:44:08: On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote: Question (for Doron and anyone else): the file is xml and it's big, so DOM isn't going to work. I could still use something SAX based but since the format is so