>'&' is used as a part of query syntax. >But Analyzer is used after query recognition to process lexemes or >phrases. So htmlentities() may be used.
I will try to replace with some alpha digit pattern; >From the other side, it doesn't help with a problem, which we have for > full UTF-8 support. >Index manipulation engine can work with UTF-8 characters, but we can't >recognize, if it's alpha, digit or any other type of characters. Thus >input text can't be tokenized correctly. >It doesn't depend on a format (UTF-8, HTML encoded, URL encoded or so on). >Current solution is based on iconv translation intelligence. It, in >principle, should translate white spaces to ascii white space and >letters to ascii letters. In my case iconv failed to convert UTF letters to ascii letters. I think I will create a list with all non-ascii latin characters, together with some non-ascii patterns. This is the most elegant way to implement this. After this I will call iconv() for the replaced text. In this way I will help iconv intelligence. :) >I don't expect, that we will have UTF-8 compatible ctype_alpha(), >ctype_digit() functions. >Thus the only way I see now is to treat all non-ascii characters as >letters and use ctype_...() for ascii characters. >I saw a lot of UTF-8 support requests, so I think to implement this soon. >But it's a question for me, if this behavior should be default or not. >From one point of view it's a solution. From other, it should be used >with care (non-letters may be treated as a part of words). > > I want to index text in UTF-8 format. I use latin characters. >> Here are some examples of characters (encoded in ISO-8859-1): ó, é, á, etc. >> > > I used iconv function iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'Animación') > > and i got Animaci'on which also contains some break > > characters for the tokenizer. Also, for characters like é, á I got 'a, 'e. > > > > The solution is to replace `'` character with some alpha-digit pattern. But > > what If I get other break > > characters for other latin characters? Or maybe I will use other UTF-8 > > characters from german language > > which also produce some distinct break characters (not alpha-digit > > characters). > > > > I saw that some people used htmlentities which produce only 2 break > > characters ('&' and ';'). In this case I > > can find 2 alphadigit patterns to match them more easily. And htmlentities > > encode all utf-8 characters. Is this > > the best solution? Maybe there are some Analyzers which I can use and which > > not break on '&' and ';' characters. >> > > Maybe someone has a better solution or some opinions on this problem. > > > > Thank you. __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com