[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

DM Smith (JIRA) Sun, 29 Mar 2009 09:14:11 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12693579#action_12693579
 ]


DM Smith commented on LUCENE-1581:
----------------------------------

bq.Why do this?
Lucene has a bias toward English texts and does not have a fundamental 
architecture focused on internationalization and localization. IMHO, it should.

Java does not implement Unicode well and does not keep abreast with it's 
changes. It's not that ICU is the right solution. It is *a* robust solution.

bq. What prevents you in your application from creating such a filter?
Nothing at all. But I think that proper behavior regarding Unicode and locales 
is something that many want. Especially for non-English indexes. As such it 
belongs with Lucene not individual projects.

With that in mind, I think it would be great if Lucene were fully 
internationalized and localized, at least from a fundamental architecture 
perspective. (There is a separate issue on what core and contrib should be. I'm 
not clear where "analyzers" fall wrt that.)

As an implementation, if ICU is present it is used, with potential performance 
impacts, if not behavior degrades predictably and gracefully. This would create 
a quasi dependency not a hard one.

> LowerCaseFilter should be able to be configured to use a specific locale.
> -------------------------------------------------------------------------
>
>                 Key: LUCENE-1581
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1581
>             Project: Lucene - Java
>          Issue Type: Improvement
>            Reporter: Digy
>
> //Since I am a .Net programmer, Sample codes will be in c# but I don't think 
> that it would be a problem to understand them.
> //
> Assume an input text like "İ" and and analyzer like below
> {code}
>       public class SomeAnalyzer : Analyzer
>       {
>               public override TokenStream TokenStream(string fieldName, 
> System.IO.TextReader reader)
>               {
>                       TokenStream t = new SomeTokenizer(reader);
>                       t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t);
>                       t = new LowerCaseFilter(t);
>                       return t;
>               }
>         
>       }
> {code}
>       
> ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
>       "i" (if locale is "en-US") 
>       or 
>       "ı' if(locale is "tr-TR") (that means,this token should be input to 
> another instance of ASCIIFoldingFilter)
> So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, 
> but a better approach can be adding
> a new constructor to LowerCaseFilter and forcing it to use a specific locale.
> {code}
>     public sealed class LowerCaseFilter : TokenFilter
>     {
>         /* +++ */System.Globalization.CultureInfo CultureInfo = 
> System.Globalization.CultureInfo.CurrentCulture;
>         public LowerCaseFilter(TokenStream in) : base(in)
>         {
>         }
>         /* +++ */  public LowerCaseFilter(TokenStream in, 
> System.Globalization.CultureInfo CultureInfo) : base(in)
>         /* +++ */  {
>         /* +++ */      this.CultureInfo = CultureInfo;
>         /* +++ */  }
>               
>         public override Token Next(Token result)
>         {
>             result = Input.Next(result);
>             if (result != null)
>             {
>                 char[] buffer = result.TermBuffer();
>                 int length = result.termLength;
>                 for (int i = 0; i < length; i++)
>                     /* +++ */ buffer[i] = 
> System.Char.ToLower(buffer[i],CultureInfo);
>                 return result;
>             }
>             else
>                 return null;
>         }
>     }
> {code}
> DIGY

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1581) LowerCaseFilter should be able to be configured to use a specific locale.

Reply via email to