Hi guys
   Currenly, I'm looking into the rules of StandardTokenizer, but met some 
probleam.
    As the docs says, StandardTokenizer implements the Word Break rules from 
the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex 
#29. Also it is generated by JFlex, a lexer/scanner generator. 

   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as 
follows
     "
    HangulEx            = 
[\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
[\p{WB:Format}\p{WB:Extend}]*
HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                       
[\p{WB:Format}\p{WB:Extend}]*
NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]        
[\p{WB:Format}\p{WB:Extend}]*
KatakanaEx          = \p{WB:Katakana}                                           
[\p{WB:Format}\p{WB:Extend}]* 
MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]      
[\p{WB:Format}\p{WB:Extend}]* 
......
"
What does them mean, like HangulEx  or NumericEx  ?
In ClassicTokenizerImpl.jflex, for num, it is expressed like this
"
P           = ("_"|"-"|"/"|"."|",")
NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
           | {HAS_DIGIT} {P} {ALPHANUM}
           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
"
This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized 
as NUMBERS.



 I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode 
Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
UNICODE CHARACTER DATABASE, but they include too much information and hard to 
understand.
Anyone has some reference of these kinds of Regular Expressions or tell me 
where to find the meanings of these UNICODE Regular Expressions


Thanks.

Reply via email to