Re: Motivation for white space after entities in HTMLStripReader

Dawid Weiss Tue, 25 Nov 2008 12:15:31 -0800


Created an issue for this:


https://issues.apache.org/jira/browse/SOLR-882

and added the patch that adds a trigger to disable padding and (!) fixes a bugin the current code -- hex. entities are not properly padded.


Dawid


Dawid Weiss wrote:

Hi. Let met do this -- I'll provide the first patch that:
- addds entities which be written all in uppercase and are displayedcorrectly by browsers (amp, gt, lt, copy),
- an optional argument that would prevent emitting additional spaces ifthese are not needed.
Then I'll look at the CharFilter/CharStream and see what can be doneabout it.
Dawid


Yonik Seeley wrote:
Perhaps the new CharFilter/CharStream stuff would work for this (i.e.
HTMLStripReader modified to implement corrective offsets instead of
inserting whitespace)?

https://issues.apache.org/jira/browse/SOLR-822

-Yonik

On Sat, Nov 22, 2008 at 4:31 AM, Dawid Weiss
<[EMAIL PROTECTED]> wrote:
Thanks Grant. You mean this issue:
https://issues.apache.org/jira/browse/SOLR-42, I see now. This is aproblem
for me only, I guess, because I use HTMLStripReader independently of the
Lucene architecture. This class is public, would it make sense if Iprovideda patch that would switch the whitespace emitting functionality onand off,
depending on a particular person's use case?

Dawid

Grant Ingersoll wrote:
It is an attempt at making things work properly with the highlighter(suchthat offsets are correct). I believe it works most of the time, butthere
still might be a few issues, check JIRA.

-Grant

On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:
Hi folks. What's the motivation to add exactly the number of whitespacesafter an entity declaration in HTMLStripReader? It basically lookslike
this:

"l&oacute;d"

(UTF: lód, "ice" in Polish) is translated into:

"ló       d"
This happens both with numeric entities and named entities.Needless to
say, these added spaces in the character stream do no good as they
effectively split a single term "lód" into two meaningless terms"l" and
"d".
I can fix this in the code easily, but it looks like it wasintentional,
so before I write test cases and commit a JIRA issue I would like to
understand what the original reasons might have been (I reallydon't see
anything this would be useful for). Apologies if I'm being dim here.

Dawid
--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Motivation for white space after entities in HTMLStripReader

Reply via email to