[jira] Commented: (LANG-285) Wish : method unaccent

JIRA Tue, 24 Oct 2006 19:26:18 -0700

    [ 
http://issues.apache.org/jira/browse/LANG-285?page=comments#action_12444555 ] 
            
Guillaume Coté commented on LANG-285:
-------------------------------------


I had a look at the Normalizer class.  If you can submit a sample code that 
unaccent a String, I am willing to try it, but all I been able to do with it 
was to change the way accent are express.  It doesn't remove any accent.  For 
exemple, you could tranfort the String :

"L'\u00e9t\u00e9 o\u00f9 j'ai d\u00fb aller \u00e0 l'\u00eele d'Anticosti 
commenca t\u00f4t"

in 

"L'e\u0301te\u0301 ou\u0300 j'ai du\u0302 aller a\u0300 l'i\u0302le d'Anticosti 
commenca to\u0302t"

and the other way around, but there is no way to obtain the String 

"L'ete ou j'ai du aller à l'ile d'Anticosti commenca tot"

It is usefull if you wish to search a String with accent in a text (accent have 
to been express the same way in both the search String and the searched text).


As I understand it, we cannot use sun.text.Normalizer in commons lang, only the 
java.* methods.

I wrote the following method

        public static String unaccent(String s) {
                StringBuffer result = new StringBuffer(s.length());
        
                for (int i = 0; i < s.length(); i++) {
                        String sub = s.substring(i, i + 1);
                        
                        Object o = UNACCENT_MAP.get(sub);
                        
                        if (o == null) {
                                result.append(sub);
                        }
                        else {
                                result.append(o);
                        }
                }
                
                return result.toString();
        }

UNACCENT_MAP is a map of all accented caracters in the block :

Latin-1 (http://unicode.org/charts/PDF/U0080.pdf)
Latin Extended A (http://unicode.org/charts/PDF/U0100.pdf)
Latin Extended B (http://unicode.org/charts/PDF/U0180.pdf)
Latin Extended C (http://unicode.org/charts/PDF/U2C60.pdf)
Latin Extended D (http://unicode.org/charts/PDF/UA720.pdf)
Latin Extended Additional (http://unicode.org/charts/PDF/U1E00.pdf)

with the corresponding unaccented caracter and the caracter form the block : 

Combining Diacritical Marks (http://unicode.org/charts/PDF/U0300.pdf)

(List of all blocks : http://unicode.org/charts/)

I attached the source code of unaccent in a different file.  That list should 
be to hard to maintain since the unicode is quite stable and change in Unicode 
a very well documented.  However, the javadoc should clealy describe which 
version of unicode and which char chart we are covering.

> Wish : method unaccent
> ----------------------
>
>                 Key: LANG-285
>                 URL: http://issues.apache.org/jira/browse/LANG-285
>             Project: Commons Lang
>          Issue Type: New Feature
>            Reporter: Guillaume Coté
>            Priority: Minor
>         Attachments: UnnacentMap.java
>
>
> I would like to add a method that replace accented caracter by unaccented 
> one.  For example, with the input String "L'été où j'ai dû aller à l'île 
> d'Anticosti commenca tôt", the method would return "L'ete ou j'ai du aller à 
> l'ile d'Anticosti commenca tot".
> I suggest to call that method unaccent and to add it in StringUtils.
> If we cannot covert all case, the first version could only covert iso-8859-1.
> If you are willing to go forward with that idea, I am willing to contribute a 
> patch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LANG-285) Wish : method unaccent

Reply via email to