I want to index text in UTF-8 format. I use latin characters.  
Here are some examples of characters (encoded in ISO-8859-1): ó, é, á, etc.

I used iconv function  iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Animación') and 
i got Animaci'on which also contains some break 
characters for the tokenizer. Also, for characters like é, á I got 'a, 'e.

The solution is to replace `'` character with some alpha-digit pattern. But 
what If I get other break 
characters for other latin characters? Or maybe I will use other UTF-8 
characters from german language 
which also produce some distinct break characters (not alpha-digit characters).

I saw that some people used htmlentities which produce only 2 break characters 
('&' and ';'). In this case I 
can find 2 alphadigit patterns to match them more easily. And htmlentities 
encode all utf-8 characters. Is this
the best solution? Maybe there are some Analyzers which I can use and which not 
break on '&' and ';' characters. 

Maybe someone has a better solution or some opinions on this problem. 

Thank you.




__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com

Reply via email to