Re:Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

dr Thu, 16 Jun 2016 08:16:14 -0700

Thank you so much, Steve. Your reply is very helpful.







At 2016-06-16 23:01:18, "Steve Rowe" <sar...@gmail.com> wrote:
>Hi dr,
>
>Unicode’s character property model is described here: 
><http://unicode.org/reports/tr23/>.
>
>Wikipedia has a description of Unicode character properties: 
><https://en.wikipedia.org/wiki/Unicode_character_property>
>
>JFlex allows you to refer to the set of characters that have a given Unicode 
>property using the \p{PropertyName} syntax.  In the case of the HangulEx macro:
>
>  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
> [\p{WB:Format}\p{WB:Extend}]*
>
>This matches a Hangul script character (\p{Script:Hangul})[1] that also either 
>has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or 
>more characters that have either the “Format” or “Extend” Word-Break 
>properties[2].  
>
>Some helpful resources:
>
>* Character code charts organized by Unicode block: 
><http://www.unicode.org/charts/>
>* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - 
>note that this utility supports a different regex syntax from JFlex - click on 
>the “help” link for more info.
>
>[1] All characters matching \p{Script:Hangul}: 
><http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}>
>[2] Word-Break properties, which in JFlex can be referred to with the 
>abbreviation “WB:” in \p{WB:property-name}, are described in the table at 
><http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>.
>
>--
>Steve
>www.lucidworks.com
>
>
>> On Jun 16, 2016, at 7:01 AM, dr <bfore...@126.com> wrote:
>> 
>> Hi guys
>>   Currenly, I'm looking into the rules of StandardTokenizer, but met some 
>> probleam.
>>    As the docs says, StandardTokenizer implements the Word Break rules from 
>> the Unicode Text Segmentation algorithm, as specified in Unicode Standard 
>> Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 
>> 
>>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as 
>> follows
>>     "
>>    HangulEx            = 
>> [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
>> [\p{WB:Format}\p{WB:Extend}]*
>> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                    
>>    [\p{WB:Format}\p{WB:Extend}]*
>> NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]     
>>    [\p{WB:Format}\p{WB:Extend}]*
>> KatakanaEx          = \p{WB:Katakana}                                        
>>    [\p{WB:Format}\p{WB:Extend}]* 
>> MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]   
>>    [\p{WB:Format}\p{WB:Extend}]* 
>> ......
>> "
>> What does them mean, like HangulEx  or NumericEx  ?
>> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
>> "
>> P           = ("_"|"-"|"/"|"."|",")
>> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>>           | {HAS_DIGIT} {P} {ALPHANUM}
>>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
>> "
>> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be 
>> tokenized as NUMBERS.
>> 
>> 
>> 
>> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode 
>> Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
>> UNICODE CHARACTER DATABASE, but they include too much information and hard 
>> to understand.
>> Anyone has some reference of these kinds of Regular Expressions or tell me 
>> where to find the meanings of these UNICODE Regular Expressions
>> 
>> 
>> Thanks.
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
>For additional commands, e-mail: java-user-h...@lucene.apache.org
>

Re:Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

Reply via email to