Hi dr, Unicode’s character property model is described here: <http://unicode.org/reports/tr23/>.
Wikipedia has a description of Unicode character properties: <https://en.wikipedia.org/wiki/Unicode_character_property> JFlex allows you to refer to the set of characters that have a given Unicode property using the \p{PropertyName} syntax. In the case of the HangulEx macro: HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] [\p{WB:Format}\p{WB:Extend}]* This matches a Hangul script character (\p{Script:Hangul})[1] that also either has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or more characters that have either the “Format” or “Extend” Word-Break properties[2]. Some helpful resources: * Character code charts organized by Unicode block: <http://www.unicode.org/charts/> * UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - note that this utility supports a different regex syntax from JFlex - click on the “help” link for more info. [1] All characters matching \p{Script:Hangul}: <http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}> [2] Word-Break properties, which in JFlex can be referred to with the abbreviation “WB:” in \p{WB:property-name}, are described in the table at <http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>. -- Steve www.lucidworks.com > On Jun 16, 2016, at 7:01 AM, dr <bfore...@126.com> wrote: > > Hi guys > Currenly, I'm looking into the rules of StandardTokenizer, but met some > probleam. > As the docs says, StandardTokenizer implements the Word Break rules from > the Unicode Text Segmentation algorithm, as specified in Unicode Standard > Annex #29. Also it is generated by JFlex, a lexer/scanner generator. > > In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as > follows > " > HangulEx = > [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] > [\p{WB:Format}\p{WB:Extend}]* > HebrewOrALetterEx = [\p{WB:HebrewLetter}\p{WB:ALetter}] > [\p{WB:Format}\p{WB:Extend}]* > NumericEx = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]] > [\p{WB:Format}\p{WB:Extend}]* > KatakanaEx = \p{WB:Katakana} > [\p{WB:Format}\p{WB:Extend}]* > MidLetterEx = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}] > [\p{WB:Format}\p{WB:Extend}]* > ...... > " > What does them mean, like HangulEx or NumericEx ? > In ClassicTokenizerImpl.jflex, for num, it is expressed like this > " > P = ("_"|"-"|"/"|"."|",") > NUM = ({ALPHANUM} {P} {HAS_DIGIT} > | {HAS_DIGIT} {P} {ALPHANUM} > | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+ > | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ > | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+ > | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+) > " > This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be > tokenized as NUMBERS. > > > > I read the Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION, Unicode > Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44 > UNICODE CHARACTER DATABASE, but they include too much information and hard to > understand. > Anyone has some reference of these kinds of Regular Expressions or tell me > where to find the meanings of these UNICODE Regular Expressions > > > Thanks. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org