Re: Removing diacritics with ISOLatin1AccentFilter

Simon Willnauer Fri, 24 Jul 2009 03:15:50 -0700

On Fri, Jul 24, 2009 at 11:41 AM, luther blisset<sabri.br...@gmail.com> wrote:
>
> Hi folks,
> I just upgrading Hibernate Search library of my app and so I had to upgrade
> Lucene too and pass from 2.2 to 2.4 version.
> In Lucene 2.4 the ISOLatin1AccentFilter class has changed and I can't figure
> how it works.
> I use a TwoWayFieldBridge to index the data and this is my set method:
>
> public void set(String s, Object o, Document document, Field.Store store,
> Field.Index index, Float aFloat){
>
>        //MyObject has a field name
>        MyObject objectToIndex;
>
>        //casting from Object to MyObject
>        try{
>            objectToIndex = MyObject.class.cast(o);
>        }catch(ClassCastException cEx ){}
>
>
>
>        if (objectToIndex.getName() != null) {
>
>            ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(new
> StandardTokenizer(new StringReader(objectToIndex.getName())));
>            filter.removeAccents(objectToIndex.getName().toCharArray(),
> objectToIndex.getName().length());
>            Field name = new Field( "name",
> String.valueOf(objectToIndex.getName()).toLowerCase() , Field.Store.YES,
> Field.Index.UN_TOKENIZED );
>
>            document.add(name);
>        }
> }
>
I do not really understand what you are trying to do. do you just
wanna remove the accents from the string and index it without passing
it through an analyzer?! (Field.Index.UN_TOKENIZED will not pass the
field value to an analyzer).
do you wanna index this without an analyzer?!


If you pass an array to ISOLantin1AccentFilter#removeAccents() the
processed chars will be written to an private internal char array
inside the ISOLantin1AccentFilter. You can not use the removeAccents
method just removing the accents. what you could do as a dirty
workaround is the following:
String foo = "HÄllo HÄllo HÄllo HÄllo HÄllo";
  ISOLatin1AccentFilter filter = new ISOLatin1AccentFilter(
      new Tokenizer(new StringReader(foo)){
        private boolean isRead = false;
        public Token next(final Token reusableToken) throws IOException {
          if(isRead){
            return null;
          }
          BufferedReader reader = new BufferedReader(this.input);
          StringBuilder builder = new StringBuilder();

         char[] buffer = new char[1024];
         int read = -1;
         while((read = reader.read(buffer)) > 0){
           builder.append(buffer, 0, read);
         }
         reusableToken.setTermText(builder.toString());
         isRead = true;
         return reusableToken;
        }
  });
    Token t = filter.next();
    String foo_without_accents = t.term();
    System.out.println(foo_without_accents);
yields: HAllo HAllo HAllo HAllo HAllo


simon
>
> but it doesn't work. And if pass an accented word for the property
> objectToIndex.getName(), it remains with accent :(
> I think there is something wrong in my code when I create the new instance
> of ISOLatin1AccentFilter  but I can' t get it works properly.
> Could someone help me?
> thanks a lot
> --
> View this message in context: 
> http://www.nabble.com/Removing-diacritics-with-ISOLatin1AccentFilter-tp24641618p24641618.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Removing diacritics with ISOLatin1AccentFilter

Reply via email to