Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi Ahmet, Ok. Thanks for your advice. Regards, Edwin On 25 November 2017 at 10:23, Ahmet Arslanwrote: > > > Hi Zheng, > > UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps > them single token. > > StandardTokenizer produce two or more tokens for an entity. > > Please try them using the analysis page, use which one suits your > requirements. > > Ahmet > > > > On Friday, November 24, 2017, 11:46:57 AM GMT+3, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> wrote: > > > > > > Hi, > > I am indexing email addresses into Solr via EML files. Currently, I am > using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also > found that we can also use UAX29URLEmailTokenizerFactory with > LowerCaseFilterFactory. > > Does anyone have any recommendation on which Tokenizer is better? > > I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. > > Regards, > Edwin >
Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi Rick, For both of the tokenizers, it does not split on the hyphens for email like this: solr-user@lucene.apache.org The entire email address remains intact for both of the tokenizers. Regards, Edwin On 24 November 2017 at 20:19, Rick Leirwrote: > Edwin > There is a spec for which characters are acceptable in an email name, and > another spec for chars in a domain name. I suspect you will have more > success with a tokenizer which is specialized for email, but I have not > looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory split > on hyphens? > Cheers --Rick > > On November 24, 2017 3:46:46 AM EST, Zheng Lin Edwin Yeo < > edwinye...@gmail.com> wrote: > >Hi, > > > >I am indexing email addresses into Solr via EML files. Currently, I am > >using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I > >also > >found that we can also use UAX29URLEmailTokenizerFactory with > >LowerCaseFilterFactory. > > > >Does anyone have any recommendation on which Tokenizer is better? > > > >I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. > > > >Regards, > >Edwin > > -- > Sorry for being brief. Alternate email is rickleir at yahoo dot com
Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi Zheng, UAX29UET recognizes URLs and e-mails. It does not tokenize them. It keeps them single token. StandardTokenizer produce two or more tokens for an entity. Please try them using the analysis page, use which one suits your requirements. Ahmet On Friday, November 24, 2017, 11:46:57 AM GMT+3, Zheng Lin Edwin Yeowrote: Hi, I am indexing email addresses into Solr via EML files. Currently, I am using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also found that we can also use UAX29URLEmailTokenizerFactory with LowerCaseFilterFactory. Does anyone have any recommendation on which Tokenizer is better? I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. Regards, Edwin
Re: Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Edwin There is a spec for which characters are acceptable in an email name, and another spec for chars in a domain name. I suspect you will have more success with a tokenizer which is specialized for email, but I have not looked at UAX29URLEmailTokenizerFactory. Does ClassicTokenizerFactory split on hyphens? Cheers --Rick On November 24, 2017 3:46:46 AM EST, Zheng Lin Edwin Yeowrote: >Hi, > >I am indexing email addresses into Solr via EML files. Currently, I am >using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I >also >found that we can also use UAX29URLEmailTokenizerFactory with >LowerCaseFilterFactory. > >Does anyone have any recommendation on which Tokenizer is better? > >I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. > >Regards, >Edwin -- Sorry for being brief. Alternate email is rickleir at yahoo dot com
Difference between UAX29URLEmailTokenizerFactory and ClassicTokenizerFactory
Hi, I am indexing email addresses into Solr via EML files. Currently, I am using ClassicTokenizerFactory with LowerCaseFilterFactory. However, I also found that we can also use UAX29URLEmailTokenizerFactory with LowerCaseFilterFactory. Does anyone have any recommendation on which Tokenizer is better? I am currently using Solr 6.5.1, and planning to upgrade to Solr 7.1.0. Regards, Edwin