Thank you so much, Steve. Your reply is very helpful.
At 2016-06-16 23:01:18, "Steve Rowe" <sar...@gmail.com> wrote: >Hi dr, > >Unicode’s character property model is described here: ><http://unicode.org/reports/tr23/>. > >Wikipedia has a description of Unicode character properties: ><https://en.wikipedia.org/wiki/Unicode_character_property> > >JFlex allows you to refer to the set of characters that have a given Unicode >property using the \p{PropertyName} syntax. In the case of the HangulEx macro: > > HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] > [\p{WB:Format}\p{WB:Extend}]* > >This matches a Hangul script character (\p{Script:Hangul})[1] that also either >has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or >more characters that have either the “Format” or “Extend” Word-Break >properties[2]. > >Some helpful resources: > >* Character code charts organized by Unicode block: ><http://www.unicode.org/charts/> >* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - >note that this utility supports a different regex syntax from JFlex - click on >the “help” link for more info. > >[1] All characters matching \p{Script:Hangul}: ><http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}> >[2] Word-Break properties, which in JFlex can be referred to with the >abbreviation “WB:” in \p{WB:property-name}, are described in the table at ><http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>. > >-- >Steve >www.lucidworks.com > > >> On Jun 16, 2016, at 7:01 AM, dr <bfore...@126.com> wrote: >> >> Hi guys >> Currenly, I'm looking into the rules of StandardTokenizer, but met some >> probleam. >> As the docs says, StandardTokenizer implements the Word Break rules from >> the Unicode Text Segmentation algorithm, as specified in Unicode Standard >> Annex #29. Also it is generated by JFlex, a lexer/scanner generator. >> >> In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as >> follows >> " >> HangulEx = >> [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] >> [\p{WB:Format}\p{WB:Extend}]* >> HebrewOrALetterEx = [\p{WB:HebrewLetter}\p{WB:ALetter}] >> [\p{WB:Format}\p{WB:Extend}]* >> NumericEx = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]] >> [\p{WB:Format}\p{WB:Extend}]* >> KatakanaEx = \p{WB:Katakana} >> [\p{WB:Format}\p{WB:Extend}]* >> MidLetterEx = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}] >> [\p{WB:Format}\p{WB:Extend}]* >> ...... >> " >> What does them mean, like HangulEx or NumericEx ? >> In ClassicTokenizerImpl.jflex, for num, it is expressed like this >> " >> P = ("_"|"-"|"/"|"."|",") >> NUM = ({ALPHANUM} {P} {HAS_DIGIT} >> | {HAS_DIGIT} {P} {ALPHANUM} >> | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ >> | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ >> | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ >> | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) >> " >> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be >> tokenized as NUMBERS. >> >> >> >> I read the Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, Unicode >> Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44 >> UNICODE CHARACTER DATABASE, but they include too much information and hard >> to understand. >> Anyone has some reference of these kinds of Regular Expressions or tell me >> where to find the meanings of these UNICODE Regular Expressions >> >> >> Thanks. > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >For additional commands, e-mail: java-user-h...@lucene.apache.org >