Hi Dominic:
On Thu, 07 Aug 2008, Dominic Lukas Wyler wrote:
> The edits in search_engine for accent stripping involved adding
> support for iso-8859-2 and iso-8859-15 characters.
Your changes look good.
> Removing that character from the list fixed the issue. Thank you very
> much for your help.
Good then.
> But now, if I want to keep this character as a separator (many of our
> submitted documents contain such quotes), I assume I have to proceed
> as was done with the accent stripping: have the current phrase in
> bibindex_engine.get_words_from_phrase() in unicode, as well as all the
> regexps ?
Yes, that would be the safest approach.
P.S. Instead of doing several "Binary/UTF8 -> Unicode -> Binary/UTF8"
transformations, we should one day probably move towards using
Unicode strings internally everywhere, right from the run_sql()
output, and convert to UTF-8 only before sending the results back
to the browser...
Best regards
--
Tibor Simko ** CERN Document Server ** <http://cds.cern.ch/>