WDGF (and WordDelimiterFilter) treat emoji as "SUBWORD_DELIM" characters like punctuation and thus remove them, but we would like to be able to search for emoji and use this filter for handling dashes, dots and other intra-word punctuation.
These filters identify non-word and non-digit characters by two mechanisms: direct lookup in a character table, and fallback to Unicode class. The character table can't easily be used to handle emoji since it would need to be populated with the entire Unicode character set in order to reach emoji-land. On the other hand, if we change the handling of emoji by class, and say treat them as word-characters, this will also end up pulling in all the other OTHER_SYMBOL characters as well. Maybe that's OK, but I think some of these other symbols are more like punctuation (this class is a grab bag of all kinds of beautiful dingbats like trademark, degrees-symbols, etc https://www.compart.com/en/unicode/category/So). On the other other hand, how do we even identify emoji? I don't think the Java Character API is adequate to the task. Perhaps we must incorporate a table. Suppose we come up with a good way to classify emoji; then how should they be treated in this class? Sometimes they may be embedded in tokens with other characters: I see people using emoji and other symbols as part of their names, and sometimes they stand alone (with whitespace separation). I think one way forward here would be to treat these as a special class akin to words and numbers, and provide similar options (SPLIT_ON_EMOJI, CATENATE_EMOJI) as we have for those classes. Or maybe as a convenience, we provide a way to get a table that encodes the default classifications of all characters up to some given limit, and then let the caller modify it? That would at least provide an easy way to treat emoji as letters. Any thoughts?