> Any thoughts? best idea I have would be to tokenize with ICUTokenizer, which will tag emoji sequences as "<EMOJI>" token type, then use ConditionalTokenFilter to send all tokens EXCEPT those with token type of "<EMOJI>" to your WordDelimiterFilter. This way WordDelimiterFilter never sees the emoji at all and can't screw them up.
--------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
