Re: [PR] TruncateTokenFilter truncates safely when the final char is a surrogate pair [lucene]

via GitHub Thu, 02 Apr 2026 08:08:57 -0700


uschindler commented on PR #15900:
URL: https://github.com/apache/lucene/pull/15900#issuecomment-4178566600


   > If we make a new filter and deprecate the old one, then we won't be 
changing behavior for people that may be relying on the current behavior. On 
the other hand, anybody that is going to see a significant change as a result 
(say they are processing chinese text that has lots of two-character 
codepoints) is already dealing with having lots of truncated codepoints, which 
is buggy, and they are going to want to switch to the new impl, even if it 
means they need to halve the truncation length to get comparable behavior to 
what they had before. So on the whole, I'd be in favor of fixing the existing 
filter.
   
   So let me push my changes here and we can still discuss if we split the 
filter, ok? Or should I create a new PR. But doing this would make the 
discussion here invisible, so I'd prefer to just push here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] TruncateTokenFilter truncates safely when the final char is a surrogate pair [lucene]

Reply via email to