On Apr 2, 2007, at 3:41 PM, Steven Parkes wrote:

I checked and there are escape sequences in there. If it was ever
debatable, I think that tips it in favor of SAX. xerces? The
contrib/gdata stuff seems to use it.

Xerces should be fine, I think.


I suppose if I'm careful and creative enough, we could share a lot of
the code amongst benchmark ingesters that use XML, should there be more
...


Yes, indeed. May not be necessary initially, but we could support XPath or something down the road to allow us to specify what things we are interested in. I wouldn't worry about generalizing too much to start with. Once we have a couple collections then we can go that route.

-----Original Message-----
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 28, 2007 10:44 AM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Updated: (LUCENE-848) Add supported for Wikipedia
English as a corpus in the benchmarker stuff


On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:


     [ https://issues.apache.org/jira/browse/LUCENE-848?
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steven Parkes updated LUCENE-848:
---------------------------------

      Description: Add support for using Wikipedia for
benchmarking.  (was: Add support for using Wikipedia for
benchmarking. If no one is working on this, I'll start soon.)
    Lucene Fields:   (was: [New])
          Summary: Add supported for Wikipedia English as a corpus
in the benchmarker stuff  (was: Add supported for Wikipediea
English as a corpus in the benchmarker stuff)

Can't leave the typo in the title. It's bugging me.

Karl, it looks like your stuff grabs individual articles, right?
I'm gong to have it download the bzip2 snapshots they provide (and
that they prefer you use, if you're getting much).

Question (for Doron and anyone else): the file is xml and it's big,
so DOM isn't going to work. I could still use something SAX based
but since the format is so tightly controlled, I'm thinking regular
expressions would be sufficient and have less dependences. Anyone
have opinions on this?


Personally, I think SAX is the way to go, as you'll get handling of
escape sequences, etc. out of the box.  And seems like it is easier
to read/maintain????


Add supported for Wikipedia English as a corpus in the benchmarker
stuff
-------------------------------------------------------------------- -

---

                Key: LUCENE-848
URL: https://issues.apache.org/jira/browse/ LUCENE-848
            Project: Lucene - Java
         Issue Type: New Feature
         Components: contrib/benchmark
           Reporter: Steven Parkes
        Assigned To: Steven Parkes
           Priority: Minor
            Fix For: 2.2

        Attachments: WikipediaHarvester.java


Add support for using Wikipedia for benchmarking.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to