Perhaps the new CharFilter/CharStream stuff would work for this (i.e. HTMLStripReader modified to implement corrective offsets instead of inserting whitespace)?
https://issues.apache.org/jira/browse/SOLR-822 -Yonik On Sat, Nov 22, 2008 at 4:31 AM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > > Thanks Grant. You mean this issue: > https://issues.apache.org/jira/browse/SOLR-42, I see now. This is a problem > for me only, I guess, because I use HTMLStripReader independently of the > Lucene architecture. This class is public, would it make sense if I provided > a patch that would switch the whitespace emitting functionality on and off, > depending on a particular person's use case? > > Dawid > > Grant Ingersoll wrote: >> >> It is an attempt at making things work properly with the highlighter (such >> that offsets are correct). I believe it works most of the time, but there >> still might be a few issues, check JIRA. >> >> -Grant >> >> On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote: >> >>> >>> Hi folks. What's the motivation to add exactly the number of white spaces >>> after an entity declaration in HTMLStripReader? It basically looks like >>> this: >>> >>> "lód" >>> >>> (UTF: lód, "ice" in Polish) is translated into: >>> >>> "ló d" >>> >>> This happens both with numeric entities and named entities. Needless to >>> say, these added spaces in the character stream do no good as they >>> effectively split a single term "lód" into two meaningless terms "l" and >>> "d". >>> >>> I can fix this in the code easily, but it looks like it was intentional, >>> so before I write test cases and commit a JIRA issue I would like to >>> understand what the original reasons might have been (I really don't see >>> anything this would be useful for). Apologies if I'm being dim here. >>> >>> Dawid >> >> -------------------------- >> Grant Ingersoll >> >> Lucene Helpful Hints: >> http://wiki.apache.org/lucene-java/BasicsOfPerformance >> http://wiki.apache.org/lucene-java/LuceneFAQ >> >> >> >> >> >> >> >> >> >> >