Re: [jira] Updated: (LUCENE-692) Hangul Jamo (Korean) support in StandardTokenizer.jj

Steven Rowe Fri, 20 Oct 2006 11:09:47 -0700

Otis Gospodnetic wrote:
> I see it only in 1 place (Korean):

U+1100-U+11FF is included in the U+0100-U+1FFF range in LETTER:


| < #LETTER:       // unicode letters
      [
       "\u0041"-"\u005a",
       "\u0061"-"\u007a",
       "\u00c0"-"\u00d6",
       "\u00d8"-"\u00f6",
       "\u00f8"-"\u00ff",
---------------------------
       "\u0100"-"\u1fff",
---------------------------
       "\uffa0"-"\uffdc"
      ]
  >


> $ grep 11ff src/java/org/apache/lucene/analysis/standard/StandardTokenizer.jj
>        "\u1100"-"\u11ff"      // Hangul Jamo
> 
> Maybe I'm not seeing something...
> 
> Otis
> 
> ----- Original Message ----
> From: Steven Rowe <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, October 20, 2006 1:22:48 PM
> Subject: Re: [jira] Updated: (LUCENE-692) Hangul Jamo (Korean) support in 
> StandardTokenizer.jj
> 
> Joe Shaw (JIRA) wrote:
>>      [ http://issues.apache.org/jira/browse/LUCENE-692?page=all ]
> [snip]
>> One of our users reported their inability to search some Korean
>> strings.  This is because the Hangul Jamo Unicode block is not
>> included in the StandardTokenizer.jj file.
>> I'm attaching a patch which fixes this, from Young-Ho Cha.
> 
> This has already been addressed by a patch committed by Otis to fix the
> following issue (in August 2006, after the 2.0.0 release):
> 
>    https://issues.apache.org/jira/browse/LUCENE-478
> 
> Here is the Korean section from trunk version of StandardAnalyzer.jj:
> 
> | < KOREAN:                                          // Korean
>       [
>        "\uac00"-"\ud7af",     // Hangul Syllables
>        "\u1100"-"\u11ff"      // Hangul Jamo
>        // "\uac00"-"\ud7a3"
>       ]
>   >
> 
> Actually, there is an oddity here -- Otis, you last committed a change
> to this file -- do you know why the Hangul Jamo range is included twice,
> once in the KOREAN section and again in the LETTER section?
> 
> | < #LETTER:       // unicode letters
>       [
>        "\u0041"-"\u005a",
>        "\u0061"-"\u007a",
>        "\u00c0"-"\u00d6",
>        "\u00d8"-"\u00f6",
>        "\u00f8"-"\u00ff",
>        "\u0100"-"\u1fff",
>        "\uffa0"-"\uffdc"
>       ]
>   >
> 
> Steve
> 
>> Joe Shaw updated LUCENE-692:
>> ----------------------------
>>
>>     Attachment: lucene-hangul-jamo.patch
>>
>> Patch to StandardTokenizer.jj which fixes this.
>>
>>> Hangul Jamo (Korean) support in StandardTokenizer.jj
>>> ----------------------------------------------------
>>>
>>>                 Key: LUCENE-692
>>>                 URL: http://issues.apache.org/jira/browse/LUCENE-692
>>>             Project: Lucene - Java
>>>          Issue Type: Improvement
>>>          Components: Analysis
>>>    Affects Versions: 1.9, 2.0.0, 2.1, 2.0.1
>>>            Reporter: Joe Shaw
>>>            Priority: Minor
>>>         Attachments: lucene-hangul-jamo.patch


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Updated: (LUCENE-692) Hangul Jamo (Korean) support in StandardTokenizer.jj

Reply via email to