Re: proposed change to CharTokenizer

Michael Sokolov Sun, 17 Oct 2010 11:37:22 -0700

OK - no responses to this, but in case you were curious...the patch Isuggested won't work - so please don't install it :)

In the end I was able to get the behavior I wanted by fiddling withoffsets in my CharFilter, but it requires detecting token boundaries inthe CharFilter stage, which seems like abstraction leekage to me. Maybethere's a better way?


-Mike

On 10/14/2010 12:08 PM, Mike Sokolov wrote:

Background: I've been trying to enable hit highlighting of XMLdocuments in such a way that the highlighting preserves thewell-formedness of the XML.
I thought I could get this to work by implementing a CharFilter thatextracts text from XML (somewhat like HTMLStripCharFilter, except I amusing an XML parser - however I think the concept is also applicableto HTMLStripCharFilter) while preserving the offsets of the text inthe original XML document so as to enable highlighting.
I ran into a problem in CharTokenizer.incrementToken(), which callscorrectOffset() as follows:
offsetAtt.setOffset(correctOffset(start),correctOffset(start+length));
The issue is that the end offset is computed as the offset of thebeginning of the *next* block of text rather than the offset of theend of *this* block of text.
In my test case:

bold text regular text

I get tokens like this ([] showing token boundaries):

 [bold] [text][regular][text]

instead of:

 [bold][text][regular][text]
I don't think this problem can be fixed by jiggling offsets, or indeedby wrapping or extending CharTokenizer in any straightforward way.The fix I found is to change the line inCharTokenizer.incrementToken() to:
offsetAtt.setOffset(correctOffset(start),correctOffset(start+length-1)+1);
Again, conceptually, this computes the corrected offset of the lastcharacter in the token, and then marks the end of the token as theimmediately following position, rather than including all the garbagecharacters in between the end of this token and the beginning of thenext.
My impression is that this change should be completelybackwards-compatible since its behavior will be identical forCharFilters that don't actually perform character deletion, and AFAICTthe only existing CharFilter performs replacements and expansions (ofligatures and the like). But my knowledge of Lucene is far fromcomprehensive.
Does this seem like a reasonable patch?

-Mike

Michael Sokolov
Engineering Director
www.ifactory.com
@iFactoryBoston

PubFactory: the revolutionary e-publishing platform from iFactory


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: proposed change to CharTokenizer

Reply via email to