Re: [PR] TruncateTokenFilter truncates safely when the final char is a surrogate pair [lucene]

via GitHub Thu, 02 Apr 2026 07:57:56 -0700


msokolov commented on PR #15900:
URL: https://github.com/apache/lucene/pull/15900#issuecomment-4178490814


   If we make a new filter and deprecate the old one, then we won't be changing 
behavior for people that may be relying on the current behavior.  On the other 
hand, anybody that is going to see a significant change as a result (say they 
are processing chinese text that has lots of two-character codepoints) is 
already dealing with having lots of truncated codepoints, which is buggy, and 
they are going to want to switch to the new impl, even if it means they need to 
halve the truncation length to get comparable behavior to what they had before. 
So on the whole, I'd be in favor of fixing the existing filter.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] TruncateTokenFilter truncates safely when the final char is a surrogate pair [lucene]

Reply via email to