Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

Steve Rowe Thu, 16 Jun 2016 08:01:58 -0700

Hi dr,

Unicode’s character property model is described here: 
<http://unicode.org/reports/tr23/>.


Wikipedia has a description of Unicode character properties: 
<https://en.wikipedia.org/wiki/Unicode_character_property>

JFlex allows you to refer to the set of characters that have a given Unicode 
property using the \p{PropertyName} syntax.  In the case of the HangulEx macro:

  HangulEx = [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
[\p{WB:Format}\p{WB:Extend}]*

This matches a Hangul script character (\p{Script:Hangul})[1] that also either 
has the Word-Break property “ALetter” or “Hebrew_Letter”, followed by zero or 
more characters that have either the “Format” or “Extend” Word-Break 
properties[2].  

Some helpful resources:

* Character code charts organized by Unicode block: 
<http://www.unicode.org/charts/>
* UnicodeSet utility: <http://unicode.org/cldr/utility/list-unicodeset.jsp> - 
note that this utility supports a different regex syntax from JFlex - click on 
the “help” link for more info.

[1] All characters matching \p{Script:Hangul}: 
<http://unicode.org/cldr/utility/list-unicodeset.jsp?a=\p{Script:Hangul}>
[2] Word-Break properties, which in JFlex can be referred to with the 
abbreviation “WB:” in \p{WB:property-name}, are described in the table at 
<http://www.unicode.org/reports/tr29/#Default_Word_Boundaries>.

--
Steve
www.lucidworks.com


> On Jun 16, 2016, at 7:01 AM, dr <bfore...@126.com> wrote:
> 
> Hi guys
>   Currenly, I'm looking into the rules of StandardTokenizer, but met some 
> probleam.
>    As the docs says, StandardTokenizer implements the Word Break rules from 
> the Unicode Text Segmentation algorithm, as specified in Unicode Standard 
> Annex #29. Also it is generated by JFlex, a lexer/scanner generator. 
> 
>   In StandardTokenizerImpl.jflex, the Regular Expressions is expressed as 
> follows
>     "
>    HangulEx            = 
> [\p{Script:Hangul}&&[\p{WB:ALetter}\p{WB:Hebrew_Letter}]] 
> [\p{WB:Format}\p{WB:Extend}]*
> HebrewOrALetterEx   = [\p{WB:HebrewLetter}\p{WB:ALetter}]                     
>   [\p{WB:Format}\p{WB:Extend}]*
> NumericEx           = [\p{WB:Numeric}[\p{Blk:HalfAndFullForms}&&\p{Nd}]]      
>   [\p{WB:Format}\p{WB:Extend}]*
> KatakanaEx          = \p{WB:Katakana}                                         
>   [\p{WB:Format}\p{WB:Extend}]* 
> MidLetterEx         = [\p{WB:MidLetter}\p{WB:MidNumLet}\p{WB:SingleQuote}]    
>   [\p{WB:Format}\p{WB:Extend}]* 
> ......
> "
> What does them mean, like HangulEx  or NumericEx  ?
> In ClassicTokenizerImpl.jflex, for num, it is expressed like this
> "
> P           = ("_"|"-"|"/"|"."|",")
> NUM        = ({ALPHANUM} {P} {HAS_DIGIT}
>           | {HAS_DIGIT} {P} {ALPHANUM}
>           | {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+
>           | {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {ALPHANUM} {P} {HAS_DIGIT} ({P} {ALPHANUM} {P} {HAS_DIGIT})+
>           | {HAS_DIGIT} {P} {ALPHANUM} ({P} {HAS_DIGIT} {P} {ALPHANUM})+)
> "
> This is easy to understand. '29' , '29.3', '29-3', '29_3' will all be 
> tokenized as NUMBERS.
> 
> 
> 
> I read the Unicode Standard Annex #29  UNICODE TEXT SEGMENTATION,  Unicode 
> Standard Annex #18 UNICODE REGULAR EXPRESSIONS, Unicode Standard Annex #44
> UNICODE CHARACTER DATABASE, but they include too much information and hard to 
> understand.
> Anyone has some reference of these kinds of Regular Expressions or tell me 
> where to find the meanings of these UNICODE Regular Expressions
> 
> 
> Thanks.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Some questions about StandardTokenizer and UNICODE Regular Expressions

Reply via email to