I checked and there are escape sequences in there. If it was ever
debatable, I think that tips it in favor of SAX. xerces? The
contrib/gdata stuff seems to use it.
I suppose if I'm careful and creative enough, we could share a lot of
the code amongst benchmark ingesters that use XML, should there
Yes, indeed. May not be necessary initially, but we could support
XPath or something down the road to allow us to specify what things
I wouldn't worry about generalizing too much
to start with. Once we have a couple collections then we can go that
route.
My thoughts, too.
I've been
On Apr 2, 2007, at 2:50 PM, Steven Parkes wrote:
On the one hand, creating separate per-article files is clean in
that
when you then ingest, you only have disk i/o that's going to affect
the
ingest performance (as opposed to, say, uncompressing/parsing). On the
other hand, that's a lot of
On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
[ https://issues.apache.org/jira/browse/LUCENE-848?
page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steven Parkes updated LUCENE-848:
-
Description: Add support for
Grant Ingersoll [EMAIL PROTECTED] wrote on 28/03/2007 10:44:08:
On Mar 28, 2007, at 1:09 PM, Steven Parkes (JIRA) wrote:
Question (for Doron and anyone else): the file is xml and it's big,
so DOM isn't going to work. I could still use something SAX based
but since the format is so