Re: Regular expression for non-ascii chars, advanced search

Georg Baum Sun, 31 Mar 2013 12:36:02 -0700

Helge Hafting wrote:

> On 29. mars 2013 13:38, Kornel Benko wrote:
>> I seem unable to find strings using non ascii chars (e.g. latin2)
>>
>> (Please try to use UTF-8 encoding to read this mail)
>>
>> The regex search string may be "pou.i.", so I was expecting to find
>>
>> e.g. "použiť". I have to use '..' to find this single chars. ("pou..i..")
>>
> 
> I believe this happens because the "ž" is encoded as two bytes when
> using UTF-8. And I guess the regexp matching software in use works on
> "bytes", not "characters". So, you are forced to use two periods to
> match the two bytes in "ž". And more, if you want to match Chinese
> characters.


I am not sure if it has anything to do with utf8. The expanded string looks 
like it is expanded for LaTeX. This looks quite wrong to me in context of 
seraching. Why is this done?

> The solution would be regexp matching software that is unicode-aware.
> A link to such software:
> http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html

Indeed. A long time ago I made a similar comment in LaTeXFeatures.cpp.


Georg

Re: Regular expression for non-ascii chars, advanced search

Reply via email to