Ah I see -- there is \p{Emoji} to start with, which is nice, but also this extended pictographic -- I'll read more, and get back if I have questions. Might be a little while before I dig in to this though. Thanks again
On Tue, Jul 3, 2018 at 11:25 AM Robert Muir <rcm...@gmail.com> wrote: > If you customized the rules, maybe have a look at > https://issues.apache.org/jira/browse/LUCENE-8366 > > The rules got simpler and we also updated the customization example > used for the factory's test. > > On Tue, Jul 3, 2018 at 10:46 AM, Michael Sokolov <msoko...@gmail.com> > wrote: > > Yes that sounds good -- this ConditionalTokenFilter is going to be very > > helpful. We have overridden the ICUTokenizer's rbbi rules, but I'll poke > > around and see about incorporating the emoji rules from there. Thanks > > Robert > > > > On Tue, Jul 3, 2018 at 9:28 AM Robert Muir <rcm...@gmail.com> wrote: > > > >> > Any thoughts? > >> > >> best idea I have would be to tokenize with ICUTokenizer, which will > >> tag emoji sequences as "<EMOJI>" token type, then use > >> ConditionalTokenFilter to send all tokens EXCEPT those with token type > >> of "<EMOJI>" to your WordDelimiterFilter. This way > >> WordDelimiterFilter never sees the emoji at all and can't screw them > >> up. > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > >> For additional commands, e-mail: java-user-h...@lucene.apache.org > >> > >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >