[ https://issues.apache.org/jira/browse/LUCENE-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Gibney updated LUCENE-8972: ----------------------------------- Summary: CharFilter version of ICUTransformFilter, to better support dictionary-based tokenization (was: CharFilter version ICUTransformFilter, to better support dictionary-based tokenization) > CharFilter version of ICUTransformFilter, to better support dictionary-based > tokenization > ----------------------------------------------------------------------------------------- > > Key: LUCENE-8972 > URL: https://issues.apache.org/jira/browse/LUCENE-8972 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: master (9.0), 8.2 > Reporter: Michael Gibney > Priority: Minor > > The ICU Transliteration API is currently exposed through Lucene only > post-tokinzer, via ICUTransformFilter. Some tokenizers (particularly > dictionary-based) may assume pre-normalized input (e.g., for Chinese > characters, there may be an assumption of traditional-only or simplified-only > input characters, at the level of either all input, or > per-dictionary-defined-token). > The potential usefulness of a CharFilter that exposes the ICU Transliteration > API was suggested in a [thread on the Solr mailing > list|https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201807.mbox/%3C4DAB7BA7-42A8-4009-8B49-60822B00DE7D%40wunderwood.org%3E], > and my hope is that this issue can facilitate more detailed discussion of > the proposed addition. > A concrete example of mixed traditional/simplified characters that are > currently tokenized differently by the ICUTokenizer are: > * 红楼梦 (SSS) > * 紅樓夢 (TTT) > * 紅楼夢 (TST) > The first two tokens (simplified-only and traditional-only, respectively) are > included in the [CJ dictionary that backs > ICUTokenizer|https://raw.githubusercontent.com/unicode-org/icu/release-62-1/icu4c/source/data/brkitr/dictionaries/cjdict.txt], > but the last (a mixture of traditional and simplified characters) is not, > and is not recognized as a token. Even _if_ we assume this to be an > intentional omission from the dictionary that results in behavior that could > be desirable for some use cases, there are surely some use cases that would > benefit from a more permissive dictionary-based tokenization strategy (such > as could be supported by pre-tokenizer transliteration). -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org