magibney commented on issue #892: LUCENE-8972: Add ICUTransformCharFilter, to 
support pre-tokenizer ICU text transformation
URL: https://github.com/apache/lucene-solr/pull/892#issuecomment-537538822
 
 
   Here's a thought: what if we provided a boolean configuration option like 
`assumeExternalUnicodeNormalization`. Many of these transforms work on NFD 
input, and produce NFC output, but they generally are configured defensively 
(not assuming input to be NFD, and not assuming that output will be externally 
converted to NFC).
   
   This is understandable, but results in the odd situation (for example) that 
an analysis component like "ICUTransformFilter(Cyrillic-Latin)" would have NFC 
output, but _only_ for characters whose input representation matched the 
top-level Cyrillic-Latin filter (which is pretty restrictive). Input characters 
that didn't match the top-level filter would be untouched by any component of 
the underlying CompoundTransliterator. So if you want fully unicode-normalized 
output (and in the context of an analysis chain, most do), you have to 
separately apply post-transform NFD normalization anyway.
   
   At best, for this ends up doing some redundant work; but for the performance 
case we're considering here, there are particular implications. NFC, as a 
trailing transformation step, is both _very_ common and _very_ active -- active 
in the sense that it will in many common contexts block output waiting for 
combining diacritics for literally almost every character. If we know we're 
externally applying unicode normalization over the entire output, skipping 
baked-in post-NFC for every transform component avoids redundant work, but more 
importantly avoids a common case that's virtually guaranteed to result in a 
substantial amount of partial transliteration, rollback, etc. I think this can 
be done relatively cleanly using Transliterator getElements(), toRules(false), 
and createFromRules(...).
   
   I'd be curious to know what you think, @msokolov, and perhaps @rmuir?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to