I have been experimenting with a couple of HTML parsers, primarily to compare performance, but have discovered a difference in the index for which I haven't, with assurance discovered the cause.
The difference is in the number of terms reported by Luke. The indexes created with the content parsed using JTidy generally have about 30% fewer terms than those created with content parsed using HTMLParser (htmlparser.org). The only difference I can discern (using debug logs and diff) is with the way entities are handled by the two parsers. Using JTidy, any HTML entities are converted to the literal character; using HTMLParser they are left as an entity (named or numeric). In the fields that are tokenized, entities not already converted are done so in the index, which leaves only the fields not tokenized. It does not seem likely to me that this could account for 30% of the terms indexed. Is it possible to use Luke (or some other tool) to make a more detailed comparison of the two indexes? I have tried to find a difference in the top terms indexed, and while the order of the top terms does change, the numbers do not. Am I missing something obvious? Thanks, -- Robert -------------------- Robert Watkins [EMAIL PROTECTED] -------------------- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]