Perhaps the new CharFilter/CharStream stuff would work for this (i.e.
HTMLStripReader modified to implement corrective offsets instead of
inserting whitespace)?

https://issues.apache.org/jira/browse/SOLR-822

-Yonik

On Sat, Nov 22, 2008 at 4:31 AM, Dawid Weiss
<[EMAIL PROTECTED]> wrote:
>
> Thanks Grant. You mean this issue:
> https://issues.apache.org/jira/browse/SOLR-42, I see now. This is a problem
> for me only, I guess, because I use HTMLStripReader independently of the
> Lucene architecture. This class is public, would it make sense if I provided
> a patch that would switch the whitespace emitting functionality on and off,
> depending on a particular person's use case?
>
> Dawid
>
> Grant Ingersoll wrote:
>>
>> It is an attempt at making things work properly with the highlighter (such
>> that offsets are correct).  I believe it works most of the time, but there
>> still might be a few issues, check JIRA.
>>
>> -Grant
>>
>> On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:
>>
>>>
>>> Hi folks. What's the motivation to add exactly the number of white spaces
>>> after an entity declaration in HTMLStripReader? It basically looks like
>>> this:
>>>
>>> "l&oacute;d"
>>>
>>> (UTF: lód, "ice" in Polish) is translated into:
>>>
>>> "ló       d"
>>>
>>> This happens both with numeric entities and named entities. Needless to
>>> say, these added spaces in the character stream do no good as they
>>> effectively split a single term "lód" into two meaningless terms "l" and
>>> "d".
>>>
>>> I can fix this in the code easily, but it looks like it was intentional,
>>> so before I write test cases and commit a JIRA issue I would like to
>>> understand what the original reasons might have been (I really don't see
>>> anything this would be useful for). Apologies if I'm being dim here.
>>>
>>> Dawid
>>
>> --------------------------
>> Grant Ingersoll
>>
>> Lucene Helpful Hints:
>> http://wiki.apache.org/lucene-java/BasicsOfPerformance
>> http://wiki.apache.org/lucene-java/LuceneFAQ
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>

Reply via email to