I checked and there are escape sequences in there. If it was ever debatable, I think that tips it in favor of SAX. xerces? The contrib/gdata stuff seems to use it.
I suppose if I'm careful and creative enough, we could share a lot of the code amongst benchmark ingesters that use XML, should there be more ... -----Original Message----- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 28, 2007 10:44 AM To: java-dev@lucene.apache.org Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia English as a corpus in the benchmarker stuff On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote: > > [ https://issues.apache.org/jira/browse/LUCENE-848? > page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] > > Steven Parkes updated LUCENE-848: > --------------------------------- > > Description: Add support for using Wikipedia for > benchmarking. (was: Add support for using Wikipedia for > benchmarking. If no one is working on this, I'll start soon.) > Lucene Fields: (was: [New]) > Summary: Add supported for Wikipedia English as a corpus > in the benchmarker stuff (was: Add supported for Wikipediea > English as a corpus in the benchmarker stuff) > > Can't leave the typo in the title. It's bugging me. > > Karl, it looks like your stuff grabs individual articles, right? > I'm gong to have it download the bzip2 snapshots they provide (and > that they prefer you use, if you're getting much). > > Question (for Doron and anyone else): the file is xml and it's big, > so DOM isn't going to work. I could still use something SAX based > but since the format is so tightly controlled, I'm thinking regular > expressions would be sufficient and have less dependences. Anyone > have opinions on this? Personally, I think SAX is the way to go, as you'll get handling of escape sequences, etc. out of the box. And seems like it is easier to read/maintain???? > >> Add supported for Wikipedia English as a corpus in the benchmarker >> stuff >> --------------------------------------------------------------------- >> --- >> >> Key: LUCENE-848 >> URL: https://issues.apache.org/jira/browse/LUCENE-848 >> Project: Lucene - Java >> Issue Type: New Feature >> Components: contrib/benchmark >> Reporter: Steven Parkes >> Assigned To: Steven Parkes >> Priority: Minor >> Fix For: 2.2 >> >> Attachments: WikipediaHarvester.java >> >> >> Add support for using Wikipedia for benchmarking. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ------------------------------------------------------ Grant Ingersoll http://www.grantingersoll.com/ http://lucene.grantingersoll.com http://www.paperoftheweek.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]