The parsing I do currently is pretty straight forward. There are only four
tags I look for (and one of those tags typically encompasses most of the
file). Sax works great though I'm not stuck on using xerces. In the
short-run, the 25 millisecond is quite acceptable (where, for obvious
reasons, the 1.2 seconds was not). In the long-run, sounds like I need to
look at some other options besides xerces.
Another thing I noticed doing this is that the xeres sax interface tends to
pass small blocks of characters (typically, around 50 characters) on each
character callback even when there are several thousand bytes of character
data in the tag. Currently, I add each block of characters to the Document
separately. This means I often end up with 100 or more items on the
Document linked list for the same field. When I get some time, I would like
to see if things work faster if I accumulate these into a StringBuffer and
pass them to the document as one large block instead of a lot of little
blocks.
Thanks for all of the suggestions.
Scott
-Original Message-
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 08, 2004 5:24 AM
To: Lucene Users List
Subject: Re: Performance question
Dror Matalon wrote:
>On Wed, Jan 07, 2004 at 07:24:22PM -0700, Scott Smith wrote:
>
>
>>After two rather frustrating days, I find I need to apologize to
>>Lucene. My last run of 225 messages averaged around 25 milliseconds
>>per message--that's parsing the xml, creating the Document, and
>>putting it in the index (2.5Ghz cpu, 1G ram). Turns out the
>>performance problem was xerces sax "helping me" by loading the DTD
>>before it parsed each message and the DTD wasn't local to our site.
>>After seeing Terry's response, I knew there had to be more going on
>>than what I was assuming.
>>
>>Thanks for the suggestions. I wonder how much faster I can go if I
>>implement some of those?
>>
>>
>
>25 msecs to insert a document is on the high side, but it depends of
>course on the size of your document. You're probably spending 90% of
>your time in the XML parsing. I believe that there are other parsers
>that are faster than xerces, you might want to look at these. You might
>want to look at http://dom4j.org/.
>
>Dror
>
>
>
You may want to check the XML Pull Parser - it offers something between
SAX and DOM, with performance similar to SAX.
(http://www.extreme.indiana.edu/xgws/xsoap/xpp)
--
Best regards,
Andrzej Bialecki
-
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-
FreeBSD developer (http://www.freebsd.org)
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]