Re: Is there a list of "special" characters for standard analyzer?

ohaya Fri, 31 Jul 2009 08:00:53 -0700

Hi Ahmet,

Thanks for the clarification and information!  That was exactly what I was 
looking for.


Jim


---- AHMET ARSLAN <[email protected]> wrote: 
> 
> > I guess that the obvious question is "Which characters are
> > considered 'punctuation characters'?".
>  
> Punctuation = ("_"|"-"|"/"|"."|",")
> 
> > In particular, does the analyzer consider "=" (equal) and
> > ":" (colon) to be punctuation characters?
> 
> ":" is special character at QueryParser (if you are using it). If you want to 
> search it you need to escape it first. At index time this character is 
> ignored. Like the punctuations. The string ahmet:arslan will produce two 
> tokens ahmet and arslan. It also breaks words at "=" character in both 
> query/index time.
> 
> If you want to understand the behavior of StandardTokenizer, you need to look 
> at the file StandardTokenizerImpl.jflex. It recognizes the followings as one 
> token: {ALPHANUM}, {APOSTROPHE}, {ACRONYM}, {COMPANY}, {EMAIL} {HOST}, {NUM}, 
> {CJ}, {ACRONYM_DEP} and ignores the rest. There are some definitions of these 
> token types, similar to Regular Expression. You can change behavior of 
> StandardTokenizer by editing this file and generating 
> StandardTokenizerImpl.java from it. There is also another jflex file named 
> WikipediaTokenizerImpl.jflex. By looking it you can understand how new token 
> types can be added. 
> 
> Ahmet
> 
> 
>       
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Is there a list of "special" characters for standard analyzer?

Reply via email to