List Fellows: Lacking any knowledge of JavaCC, I solicted help in hacking the HTMLParser.jj included in the demo. I retreat from this solication, for two reasons: 1) I'm using other ideas gleaned from the list archives, 2) I'm not prepared to dive into the world of complier compliers. The mere sound of it is intimidating.
So the bug. (If the bug is not worth fixing in the provided HTMLParser, drop another one in, like Quiotix's; I did.) Summary: The current HTMLParser fails to correctly handle HTML decimal entities. <title>MyWebsite—Home Page</title> <p>My website’s address is...</p> The following is produced after indexing the HTML and performing a query: MyWebsite?Home Page My website?s address is... Another problem is manifest in the following oddity: Given the following *source*; **note the use of the ampersand entity** <title>MyWebsite&#8212;Home Page</title> <p>My website&#8217;s address is...</p> This produces the output (where two dashes represent an em dash) MyWebsite--Home Page My website's address is... And the source of the *results* appears correctly, even if the source document that was indexed is incorrect! Some kind of entity replacement is occuring here. <title>MyWebsite—Home Page</title> <p>My website’s address is...</p> (I ran across the latter oddity courtesy of Adobe GoLive's annoying syntax rewriter.) Now, some might be asking, and rightly so, why hasn't this been seen before? I know a search in the archives didn't turn anything up. It's likely because the use of decimal entities is misunderstood by the HTM community at large. A for instance is that some, quite possibly a whole lot, use — for em dash--this is incorrect as the whole range  to Ÿ is invalid. Second, many may use named encoding. Named encoding, i.e. &emdash;, is fine, but decimal encoding provides a more consistent behavior cross-platform. For more on this, read "The Trouble with EM 'n EN and Other Shady Characters" at A List Apart (www.alistapart.com/stories/emen/) Yours in Lucene. Tim -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>