Daniel Noll sent the message below addressed to me, and CC'd to java-dev.  I 
guess CC is not good enough for ASF's mailing list software, since I received 
this message, but it never showed up on the mailing list.  Belatedly forwarding 
it to the list now. - Steve

On 01/07/2008 at 5:06 PM, Daniel Noll wrote:
> -----Original Message-----
> From: Daniel Noll [mailto:[EMAIL PROTECTED] 
> Sent: Monday, January 07, 2008 5:07 PM
> To: Steven A Rowe
> Cc: java-dev@lucene.apache.org
> Subject: Re: Fullwidth alphanumeric characters, plus a 
> question on Korean ranges
> 
> On Tuesday 08 January 2008 05:17:28 Steven A Rowe wrote:
> > Hi Daniel,
> > 
> > I think this discussion belongs on java-dev, so I'm replying there.
> > 
> > On 01/06/2008 at 7:47 PM, Daniel Noll wrote:
> > > We discovered [in StandardTokenizer.jj] that fullwidth letters are
> > > not treated as <LETTER> and fullwidth digits are not
> treated as <DIGIT>.
> > 
> > IMHO, this should be fixed in the JFlex version of StandardTokenizer -
> > do you have details?
> 
> The following ranges are relevant here:
> 
>   FF10-FF19  Fullwidth digits
>   FF21-FF3A  Fullwidth Latin uppercase
>   FF41-FF5A  Fullwidth Latin lowercase
>   
> > > Line 87:
> > >        "\uffa0"-"\uffdc"
> > > 
> > >   The halfwidth Katakana "letters" (as Unicode calls them) are in <CJ>
> > >   as expected, so I'm wondering if these halfwidth Hangul "letters"
> > >   should actually be in <KOREAN> instead of <LETTER>.
> > 
> > [U+FFA0-U+FFDC] is Hangul Jamo (phonetic symbols), not precomposed
> > Hangul syllables.
> 
> I know.  The Unicode spec just happens to call Jamo "letters".
> 
> > However, I just noticed that [U+1100-U+11FF] is included both in the
> > <LETTER> and <KOREAN> sections - not good.  I think [U+1100-U+11FF]
> > should be removed from the <LETTER> definition, and left as-is in the
> > <KOREAN> section; and [U+FFA0-U+FFDC] should be moved from <LETTER>
> to <KOREAN>.
> 
> I think so too.  I didn't notice this overlap... makes me
> wish the parser
> could detect character range overlaps and warn about them.
> 
> I had a bit more of a look through the Unicode blocks and
> found some more
> ranges which may or may not be worth considering.
> 
> These would seem to be worthy of going in <LETTER>:
>   2C00-2DDF  (multiple blocks which appear to contain more languages)
>   A720-A7FF  Latin Extended-D
>   A800-A82F  Syloti Nagri
>   A840-A87F  Phags-pa
> 
> There are these too, but they seem obscure...
>   2460-24FF  Enclosed Alphanumerics
> 
> Then you have ligatures, which if you use a normalising
> filter later may
> resolve to perfectly normal alphabetic characters:
>   FB00-FB4F  Alphabetic Presentation Forms
>   FB50-FBFF  Arabic Presentation Forms
> 
> Then we have some high extensions to CJK.  These are
> particularly interesting
> because they would be represented in UTF-16 as surrogates and
> I have no idea
> how to even add them to the grammar for that reason.
>   20000-2A6DF  CJK Unified Ideographs Extension B
>   2F800-2FA1F  CJK Compatibility Ideographs Supplement
> 
> There may be more hidden in the blocks which don't seem
> immediately obvious.
> 
> I wish the tokeniser could just use Character.isLetter and
> Character.isDigit instead of having to know all the ranges itself, since
> the JRE already has all this information.  Character.isLetter does
> return true for CJK characters though, so the ranges would still come in
> handy for determining what kind of letter they are.  I don't support
> JFlex has a way to do this...
> 
> Daniel
>

 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to