Re: email field - analyzed and not analyzed in single field using custom analyzer

Steve Rowe Thu, 15 Jun 2017 07:14:55 -0700

Hi Kumaran,

WordDelimiterGraphFilter with PRESERVE_ORIGINAL should do what you want: 
<http://lucene.apache.org/core/6_6_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/WordDelimiterGraphFilter.html>.


Here’s a test I added to TestWordDelimiterGraphFilter.java that passed for me:

-----
public void testEmail() throws Exception {
  final int flags = GENERATE_WORD_PARTS | GENERATE_NUMBER_PARTS | 
SPLIT_ON_CASE_CHANGE | SPLIT_ON_NUMERICS | PRESERVE_ORIGINAL;    
  Analyzer a = new Analyzer() {
    @Override public TokenStreamComponents createComponents(String field) {
      Tokenizer tokenizer = new MockTokenizer(MockTokenizer.WHITESPACE, false);
      return new TokenStreamComponents(tokenizer, new 
WordDelimiterGraphFilter(tokenizer, flags, null));
    }
  };
  assertAnalyzesTo(a, "[email protected]",
      new String[] { "[email protected]", "will", "smith", "yahoo", "com" },
      null, null, null,
      new int[] { 1, 0, 1, 1, 1 },
      null, false);
  a.close();
}
-----

--
Steve
www.lucidworks.com

> On Jun 15, 2017, at 8:53 AM, Kumaran Ramasubramanian <[email protected]> 
> wrote:
> 
> Hi All,
> 
> i want to index email fields as both analyzed and not analyzed using custom
> analyzer.
> 
> for example,
> [email protected]
> [email protected]
> 
> that is,  indexing [email protected] as single token as well as analyzed
> tokens in same email field...
> 
> 
> My existing custom analyzer,
> 
> public class CustomSearchAnalyzer extends StopwordAnalyzerBase
> {
> 
>    public CustomSearchAnalyzer(Version matchVersion, Reader stopwords)
> throws Exception
>    {
>        super(matchVersion, loadStopwordSet(stopwords, matchVersion));
>    }
> 
>    @Override
>    protected Analyzer.TokenStreamComponents createComponents(final String
> fieldName, final Reader reader)
>    {
>        final ClassicTokenizer src = new ClassicTokenizer(getVersion(),
> reader);
>        src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>        TokenStream tok = new ClassicFilter(src);
>        tok = new LowerCaseFilter(getVersion(), tok);
>        tok = new StopFilter(getVersion(), tok, stopwords);
>        tok = new ASCIIFoldingFilter(tok); // to enable AccentInsensitive
> search
> 
>        return new Analyzer.TokenStreamComponents(src, tok)
>        {
>            @Override
>            protected void setReader(final Reader reader) throws IOException
>            {
> 
> src.setMaxTokenLength(ClassicAnalyzer.DEFAULT_MAX_TOKEN_LENGTH);
>                super.setReader(reader);
>            }
>        };
>    }
> }
> 
> 
> And so i want to achieve like,
> 
> 1.if i search using query "[email protected]", records with
> [email protected] should not come...
> 2.Also i should be able to search using query "smith" in that field
> 3.if possible, should be able to detect email values in all other fields
> and apply the same type of tokenization
> 
> How to achieve point 1 and 2 using UAX29URLEmailTokenizer? how to add
> UAX29URLEmailTokenizer in my existing custom analyzer without using email
> analyzer ( perfieldanalyzer )  for email field.. And so i can apply this
> tokenizer for email terms of all fields..
> 
> 
> 
> -
> Kumaran R


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: email field - analyzed and not analyzed in single field using custom analyzer

Reply via email to