Thanks Grant. You mean this issue: https://issues.apache.org/jira/browse/SOLR-42, I see now. This is a problem for me only, I guess, because I use HTMLStripReader independently of the Lucene architecture. This class is public, would it make sense if I provided a patch that would switch the whitespace emitting functionality on and off, depending on a particular person's use case?

Dawid

Grant Ingersoll wrote:
It is an attempt at making things work properly with the highlighter (such that offsets are correct). I believe it works most of the time, but there still might be a few issues, check JIRA.

-Grant

On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:


Hi folks. What's the motivation to add exactly the number of white spaces after an entity declaration in HTMLStripReader? It basically looks like this:

"lód"

(UTF: lód, "ice" in Polish) is translated into:

"ló       d"

This happens both with numeric entities and named entities. Needless to say, these added spaces in the character stream do no good as they effectively split a single term "lód" into two meaningless terms "l" and "d".

I can fix this in the code easily, but it looks like it was intentional, so before I write test cases and commit a JIRA issue I would like to understand what the original reasons might have been (I really don't see anything this would be useful for). Apologies if I'm being dim here.

Dawid

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ










Reply via email to