Hi guys Currenly, I'm looking into the rules of StandardTokenizer, but met some probleam. As the docs says, StandardTokenizer implements the Word Break rules from the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29. Also it is generated by JFlex, a lexer/scanner generator.
In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as follows " HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]* HebrewOrALetterEx = [\p{WB:HebrewLetter}\p{WB:ALetter}] [\p{WB:Format}\p{WB:Extend}]* NumericEx = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]] [\p{WB:Format}\p{WB:Extend}]* KatakanaEx = \p{WB:Katakana} [\p{WB:Format}\p{WB:Extend}]* MidLetterEx = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}] [\p{WB:Format}\p{WB:Extend}]* ...... " What does them mean, like HangulEx or NumericEx ? In ClassicTokenizerImpl.jflex, for num, it is expressed like this " P = ("_"|"-"|"/"|"."|",") NUM = ({ALPHANUM} {P} {HAS_DIGIT} | {HAS_DIGIT} {P} {ALPHANUM} | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) " This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be tokenized as NUMBERS. I read the Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, Unicode Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44 UNICODE CHARACTER DATABASE, but they include too much information and hard to understand. Anyone has some reference of these kinds of Regular Expressions or tell me where to find the meanings of these UNICODE Regular Expressions Thanks.