Re: Re: Regular expression for non-ascii chars, advanced search

2013-04-07 Thread Kornel Benko
Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta 
tomm...@lyx.org
 So,
 
 I came up with this trivial patch for the kind of scenario you proposed. 
 Simply,
 export an regexp inset using the text, rather than math, encoding rules.
 AFAICS, one might usefully be willing to write text (and special chars) in a
 regexp context.

For normal text it is wonderful :).

 However, it's not conclusive, nor can it be.
 
 Imagine I write the word you were mentioning (použiť) both as regular text in
 a document, AND within a math inset.

Writing as textrm inside math, this new algorithm finds the text. Previously it 
was not the case.

If I uncheck 'ignore format', then I cannot find any (regular or not) string 
now :(. (With or without non-ascii).
But the behaviour seems undefined. The next time I tried to search,
with a string (copy  paste), it could find the string (iff all ascci).
More test on this confused me even more.
I could not see any regularity when the string will be found and when not.

And not ignoring format is still slow. 

 Then, I search for it through Advanced Find.
 
 If I enter the word as simple text in the Find box, then it finds only the 
 text
 counter-part in the document, but it cannot match the math one. If I enter the
 word in math mode, then it's the other way round. If I enter the word in 
 regexp
 mode, then I match one or the other depending on whether you applied my 
 attached
 patch :-).
 
 Now, such a behaviour might have been OK for Ignore Format unchecked, but it
 happens when it's checked as well, and it shouldn't happen.
 
 This is probably one of the many other Advanced Find scenarios that can be
 addressed by modifying the export/write/latex logic introducing a special
 export mode that carries along the matching options, and lets insets export
 what makes sense and is appropriate considering them, rather than trying to
 fix the situation through impossible regexp post-processing after the export
 (the current implementation is very fragile, if one tries to search for
 {{{, or \regexp, or a combination of them, or similar, I don't know what
 can happen). Such a focused export for advanced FR should also speed up
 tremendously the operation.
 
 comments ?
 
T.

I think, own export format would be best. 

Kornel

signature.asc
Description: This is a digitally signed message part.


Re: Regular expression for non-ascii chars, advanced search

2013-04-07 Thread Tommaso Cucinotta
On 07/04/13 09:34, Kornel Benko wrote:
 Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta 
 tomm...@lyx.org
 I came up with this trivial patch for the kind of scenario you proposed. 
 Simply,
 export an regexp inset using the text, rather than math, encoding rules.
 AFAICS, one might usefully be willing to write text (and special chars) in a
 regexp context.

 For normal text it is wonderful :).

So, it's in: [6a3792bd/lyxgit].

In the end, using non-ASCII for finding in regular text seems more needed than 
in maths.

If you can spot other issues after this commit, pls let me know (except for 
finding
the non-ASCII in maths through regexps, which now doesn't work and we know, but 
perhaps
one day I'll find some time to rework the engine).

Thanks,

T.


Re: Re: Regular expression for non-ascii chars, advanced search

2013-04-07 Thread Kornel Benko
Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta 

> So,
> 
> I came up with this trivial patch for the kind of scenario you proposed. 
> Simply,
> export an regexp inset using the text, rather than math, "encoding" rules.
> AFAICS, one might usefully be willing to write text (and special chars) in a
> regexp context.

For normal text it is wonderful :).

> However, it's not conclusive, nor can it be.
> 
> Imagine I write the word you were mentioning (použiť) both as regular text in
> a document, AND within a math inset.

Writing as textrm inside math, this new algorithm finds the text. Previously it 
was not the case.

If I uncheck 'ignore format', then I cannot find any (regular or not) string 
now :(. (With or without non-ascii).
But the behaviour seems undefined. The next time I tried to search,
with a string (copy & paste), it could find the string (iff all ascci).
More test on this confused me even more.
I could not see any regularity when the string will be found and when not.

And not ignoring format is still slow. 

> Then, I search for it through Advanced Find.
> 
> If I enter the word as simple text in the Find box, then it finds only the 
> text
> counter-part in the document, but it cannot match the math one. If I enter the
> word in math mode, then it's the other way round. If I enter the word in 
> regexp
> mode, then I match one or the other depending on whether you applied my 
> attached
> patch :-).
> 
> Now, such a behaviour might have been OK for Ignore Format unchecked, but it
> happens when it's checked as well, and it shouldn't happen.
> 
> This is probably one of the many other Advanced Find scenarios that can be
> addressed by modifying the export/write/latex logic introducing a special
> export mode that carries along the matching options, and lets insets export
> what makes sense and is appropriate considering them, rather than trying to
> fix the situation through impossible regexp post-processing after the export
> (the current implementation is very fragile, if one tries to search for
> "{{{", or "\regexp", or a combination of them, or similar, I don't know what
> can happen). Such a focused export for advanced F should also speed up
> tremendously the operation.
> 
> comments ?
> 
>T.

I think, own export format would be best. 

Kornel

signature.asc
Description: This is a digitally signed message part.


Re: Regular expression for non-ascii chars, advanced search

2013-04-07 Thread Tommaso Cucinotta
On 07/04/13 09:34, Kornel Benko wrote:
> Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta 
> 
>> I came up with this trivial patch for the kind of scenario you proposed. 
>> Simply,
>> export an regexp inset using the text, rather than math, "encoding" rules.
>> AFAICS, one might usefully be willing to write text (and special chars) in a
>> regexp context.

> For normal text it is wonderful :).

So, it's in: [6a3792bd/lyxgit].

In the end, using non-ASCII for finding in regular text seems more needed than 
in maths.

If you can spot other issues after this commit, pls let me know (except for 
finding
the non-ASCII in maths through regexps, which now doesn't work and we know, but 
perhaps
one day I'll find some time to rework the engine).

Thanks,

T.


Re: Regular expression for non-ascii chars, advanced search

2013-04-06 Thread Tommaso Cucinotta
On 03/04/13 22:40, Kornel Benko wrote:
 I want to find (as regular expression) the string použiť. In tex, it looks
 použi\v{t}.
 But the searched string (as it is diplayed while filling the search form) it 
 is
 \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}.

Ok, I could reproduce thanks to the file you sent me. From a first impression,
it seems to me that the problem is NOT the regexp matching engine. Indeed, when
searching with regexp but with no non-ASCII char in the regexp, it works fine
and it finds for example

  použ\regexp{i\endregexp{}}ť

However, when the regexp contains non-ASCII chars, then it's misinterpreted
in the text conversion. I suspect it's due to the fact that the regexp inset
has been essentially derived from a math inset, so it's not expecting any
non-ASCII stuff therein, and it's not applying the
regular non-ASCII chars mangling that is instead done correctly for text.

I'll try to look into it.

T.



Re: Re: Regular expression for non-ascii chars, advanced search

2013-04-06 Thread Kornel Benko
Am Samstag, 6. April 2013 um 19:32:58, schrieb Tommaso Cucinotta 
tomm...@lyx.org
 On 03/04/13 22:40, Kornel Benko wrote:
  I want to find (as regular expression) the string použiť. In tex, it looks
  použi\v{t}.
  But the searched string (as it is diplayed while filling the search form) 
  it is
  \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}.
 
 Ok, I could reproduce thanks to the file you sent me. From a first impression,
 it seems to me that the problem is NOT the regexp matching engine. Indeed, 
 when
 searching with regexp but with no non-ASCII char in the regexp, it works fine
 and it finds for example
 
   použ\regexp{i\endregexp{}}ť
 
 However, when the regexp contains non-ASCII chars, then it's misinterpreted
 in the text conversion. I suspect it's due to the fact that the regexp inset
 has been essentially derived from a math inset, so it's not expecting any
 non-ASCII stuff therein, and it's not applying the
 regular non-ASCII chars mangling that is instead done correctly for text.
 
 I'll try to look into it.
 
   T.

Yes, all non-ascii characters are treated as being part of math. Therefore
they are replaced according to our unicode-file.

In most cases we respect the type of InsetMathHull (for regex it is hullRegexp).
Somewhere we split the search string at non-ascii, but I failed to find where.

Kornel

signature.asc
Description: This is a digitally signed message part.


Re: Regular expression for non-ascii chars, advanced search

2013-04-06 Thread Tommaso Cucinotta
On 06/04/13 20:04, Kornel Benko wrote:
 However, when the regexp contains non-ASCII chars, then it's misinterpreted
 in the text conversion. I suspect it's due to the fact that the regexp inset
 has been essentially derived from a math inset, so it's not expecting any
 non-ASCII stuff therein, and it's not applying the
 regular non-ASCII chars mangling that is instead done correctly for text.
 I'll try to look into it.

So,

I came up with this trivial patch for the kind of scenario you proposed. Simply,
export an regexp inset using the text, rather than math, encoding rules.
AFAICS, one might usefully be willing to write text (and special chars) in a
regexp context.

However, it's not conclusive, nor can it be.

Imagine I write the word you were mentioning (použiť) both as regular text in
a document, AND within a math inset.

Then, I search for it through Advanced Find.

If I enter the word as simple text in the Find box, then it finds only the text
counter-part in the document, but it cannot match the math one. If I enter the
word in math mode, then it's the other way round. If I enter the word in regexp
mode, then I match one or the other depending on whether you applied my attached
patch :-).

Now, such a behaviour might have been OK for Ignore Format unchecked, but it
happens when it's checked as well, and it shouldn't happen.

This is probably one of the many other Advanced Find scenarios that can be
addressed by modifying the export/write/latex logic introducing a special
export mode that carries along the matching options, and lets insets export
what makes sense and is appropriate considering them, rather than trying to
fix the situation through impossible regexp post-processing after the export
(the current implementation is very fragile, if one tries to search for
{{{, or \regexp, or a combination of them, or similar, I don't know what
can happen). Such a focused export for advanced FR should also speed up
tremendously the operation.

comments ?

T.

diff --git a/src/mathed/InsetMathHull.cpp b/src/mathed/InsetMathHull.cpp
index 7002a9b0..4cfdbca8 100644
--- a/src/mathed/InsetMathHull.cpp
+++ b/src/mathed/InsetMathHull.cpp
@@ -1240,7 +1240,8 @@ docstring InsetMathHull::eolString(row_type row, bool fragile, bool latex,
 
 void InsetMathHull::write(WriteStream  os) const
 {
-	ModeSpecifier specifier(os, MATH_MODE);
+	ModeSpecifier specifier(os,
+		type_ == hullRegexp ? TEXT_MODE : MATH_MODE);
 	header_write(os);
 	InsetMathGrid::write(os);
 	footer_write(os);


Re: Regular expression for non-ascii chars, advanced search

2013-04-06 Thread Tommaso Cucinotta
On 03/04/13 22:40, Kornel Benko wrote:
> I want to find (as regular expression) the string "použiť". In tex, it looks
> "použi\v{t}".
> But the searched string (as it is diplayed while filling the search form) it 
> is
> "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}".

Ok, I could reproduce thanks to the file you sent me. From a first impression,
it seems to me that the problem is NOT the regexp matching engine. Indeed, when
searching with regexp but with no non-ASCII char in the regexp, it works fine
and it finds for example

  použ\regexp{i\endregexp{}}ť

However, when the regexp contains non-ASCII chars, then it's misinterpreted
in the text conversion. I suspect it's due to the fact that the regexp inset
has been essentially derived from a math inset, so it's not expecting any
non-ASCII stuff therein, and it's not applying the
regular non-ASCII chars mangling that is instead done correctly for text.

I'll try to look into it.

T.



Re: Re: Regular expression for non-ascii chars, advanced search

2013-04-06 Thread Kornel Benko
Am Samstag, 6. April 2013 um 19:32:58, schrieb Tommaso Cucinotta 

> On 03/04/13 22:40, Kornel Benko wrote:
> > I want to find (as regular expression) the string "použiť". In tex, it looks
> > "použi\v{t}".
> > But the searched string (as it is diplayed while filling the search form) 
> > it is
> > "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}".
> 
> Ok, I could reproduce thanks to the file you sent me. From a first impression,
> it seems to me that the problem is NOT the regexp matching engine. Indeed, 
> when
> searching with regexp but with no non-ASCII char in the regexp, it works fine
> and it finds for example
> 
>   použ\regexp{i\endregexp{}}ť
> 
> However, when the regexp contains non-ASCII chars, then it's misinterpreted
> in the text conversion. I suspect it's due to the fact that the regexp inset
> has been essentially derived from a math inset, so it's not expecting any
> non-ASCII stuff therein, and it's not applying the
> regular non-ASCII chars mangling that is instead done correctly for text.
> 
> I'll try to look into it.
> 
>   T.

Yes, all non-ascii characters are treated as being part of math. Therefore
they are replaced according to our unicode-file.

In most cases we respect the type of InsetMathHull (for regex it is hullRegexp).
Somewhere we split the search string at non-ascii, but I failed to find where.

Kornel

signature.asc
Description: This is a digitally signed message part.


Re: Regular expression for non-ascii chars, advanced search

2013-04-06 Thread Tommaso Cucinotta
On 06/04/13 20:04, Kornel Benko wrote:
>> However, when the regexp contains non-ASCII chars, then it's misinterpreted
>> in the text conversion. I suspect it's due to the fact that the regexp inset
>> has been essentially derived from a math inset, so it's not expecting any
>> non-ASCII stuff therein, and it's not applying the
>> regular non-ASCII chars mangling that is instead done correctly for text.
>> I'll try to look into it.

So,

I came up with this trivial patch for the kind of scenario you proposed. Simply,
export an regexp inset using the text, rather than math, "encoding" rules.
AFAICS, one might usefully be willing to write text (and special chars) in a
regexp context.

However, it's not conclusive, nor can it be.

Imagine I write the word you were mentioning (použiť) both as regular text in
a document, AND within a math inset.

Then, I search for it through Advanced Find.

If I enter the word as simple text in the Find box, then it finds only the text
counter-part in the document, but it cannot match the math one. If I enter the
word in math mode, then it's the other way round. If I enter the word in regexp
mode, then I match one or the other depending on whether you applied my attached
patch :-).

Now, such a behaviour might have been OK for Ignore Format unchecked, but it
happens when it's checked as well, and it shouldn't happen.

This is probably one of the many other Advanced Find scenarios that can be
addressed by modifying the export/write/latex logic introducing a special
export mode that carries along the matching options, and lets insets export
what makes sense and is appropriate considering them, rather than trying to
fix the situation through impossible regexp post-processing after the export
(the current implementation is very fragile, if one tries to search for
"{{{", or "\regexp", or a combination of them, or similar, I don't know what
can happen). Such a focused export for advanced F should also speed up
tremendously the operation.

comments ?

T.

diff --git a/src/mathed/InsetMathHull.cpp b/src/mathed/InsetMathHull.cpp
index 7002a9b0..4cfdbca8 100644
--- a/src/mathed/InsetMathHull.cpp
+++ b/src/mathed/InsetMathHull.cpp
@@ -1240,7 +1240,8 @@ docstring InsetMathHull::eolString(row_type row, bool fragile, bool latex,
 
 void InsetMathHull::write(WriteStream & os) const
 {
-	ModeSpecifier specifier(os, MATH_MODE);
+	ModeSpecifier specifier(os,
+		type_ == hullRegexp ? TEXT_MODE : MATH_MODE);
 	header_write(os);
 	InsetMathGrid::write(os);
 	footer_write(os);


Re: Regular expression for non-ascii chars, advanced search

2013-04-03 Thread Tommaso Cucinotta
On 02/04/13 10:28, Jean-Marc Lasgouttes wrote:
 I am not sure if it has anything to do with utf8. The expanded string looks
 like it is expanded for LaTeX. This looks quite wrong to me in context of
 searching. Why is this done?
 
 This is how advanced search works.

Confirm. This is how the current implementation works. Nonetheless, alternative 
(and more efficient) implementations of the idea might be possible.

T.



Re: Re: Regular expression for non-ascii chars, advanced search

2013-04-03 Thread Kornel Benko
Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta 
tomm...@lyx.org
 On 02/04/13 10:28, Jean-Marc Lasgouttes wrote:
  I am not sure if it has anything to do with utf8. The expanded string looks
  like it is expanded for LaTeX. This looks quite wrong to me in context of
  searching. Why is this done?
  
  This is how advanced search works.
 
 Confirm. This is how the current implementation works. Nonetheless,
 alternative (and more efficient) implementations of the idea might be 
 possible.
 
   T.

I disagree (not confirming).

I want to find (as regular expression) the string použiť. In tex, it looks
použi\v{t}.
But the searched string (as it is diplayed while filling the search form) it is
\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}.

It looks more, like it were interpreting mathematical input.

Kornel

signature.asc
Description: This is a digitally signed message part.


Re: Regular expression for non-ascii chars, advanced search

2013-04-03 Thread Tommaso Cucinotta
I cannot enter those chars on my keyboard. Can u provide a sample lyx file 
where I can search and see the problem ?

T.

On 03/04/13 22:40, Kornel Benko wrote:
 Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta 
 tomm...@lyx.org
 
 On 02/04/13 10:28, Jean-Marc Lasgouttes wrote:
 
  I am not sure if it has anything to do with utf8. The expanded string 
  looks
 
  like it is expanded for LaTeX. This looks quite wrong to me in context of
 
  searching. Why is this done?
 
 
 
  This is how advanced search works.
 

 
 Confirm. This is how the current implementation works. Nonetheless,
 
 alternative (and more efficient) implementations of the idea might be 
 possible.
 

 
 T.
 
  
 
 I disagree (not confirming).
 
  
 
 I want to find (as regular expression) the string použiť. In tex, it looks
 
 použi\v{t}.
 
 But the searched string (as it is diplayed while filling the search form) it 
 is
 
 \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}.
 
  
 
 It looks more, like it were interpreting mathematical input.
 
  
 
 Kornel
 



Re: Regular expression for non-ascii chars, advanced search

2013-04-03 Thread Tommaso Cucinotta
On 02/04/13 10:28, Jean-Marc Lasgouttes wrote:
>> I am not sure if it has anything to do with utf8. The expanded string looks
>> like it is expanded for LaTeX. This looks quite wrong to me in context of
>> searching. Why is this done?
> 
> This is how advanced search works.

Confirm. This is how the current implementation works. Nonetheless, alternative 
(and more efficient) implementations of the idea might be possible.

T.



Re: Re: Regular expression for non-ascii chars, advanced search

2013-04-03 Thread Kornel Benko
Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta 

> On 02/04/13 10:28, Jean-Marc Lasgouttes wrote:
> >> I am not sure if it has anything to do with utf8. The expanded string looks
> >> like it is expanded for LaTeX. This looks quite wrong to me in context of
> >> searching. Why is this done?
> > 
> > This is how advanced search works.
> 
> Confirm. This is how the current implementation works. Nonetheless,
> alternative (and more efficient) implementations of the idea might be 
> possible.
> 
>   T.

I disagree (not confirming).

I want to find (as regular expression) the string "použiť". In tex, it looks
"použi\v{t}".
But the searched string (as it is diplayed while filling the search form) it is
"\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}".

It looks more, like it were interpreting mathematical input.

Kornel

signature.asc
Description: This is a digitally signed message part.


Re: Regular expression for non-ascii chars, advanced search

2013-04-03 Thread Tommaso Cucinotta
I cannot enter those chars on my keyboard. Can u provide a sample lyx file 
where I can search and see the problem ?

T.

On 03/04/13 22:40, Kornel Benko wrote:
> Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta 
> 
> 
>> On 02/04/13 10:28, Jean-Marc Lasgouttes wrote:
> 
>> >> I am not sure if it has anything to do with utf8. The expanded string 
>> >> looks
> 
>> >> like it is expanded for LaTeX. This looks quite wrong to me in context of
> 
>> >> searching. Why is this done?
> 
>> >
> 
>> > This is how advanced search works.
> 
>>
> 
>> Confirm. This is how the current implementation works. Nonetheless,
> 
>> alternative (and more efficient) implementations of the idea might be 
>> possible.
> 
>>
> 
>> T.
> 
>  
> 
> I disagree (not confirming).
> 
>  
> 
> I want to find (as regular expression) the string "použiť". In tex, it looks
> 
> "použi\v{t}".
> 
> But the searched string (as it is diplayed while filling the search form) it 
> is
> 
> "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}".
> 
>  
> 
> It looks more, like it were interpreting mathematical input.
> 
>  
> 
> Kornel
> 



Re: Regular expression for non-ascii chars, advanced search

2013-04-02 Thread Jean-Marc Lasgouttes

31/03/2013 21:35, Georg Baum:

I believe this happens because the ž is encoded as two bytes when
using UTF-8. And I guess the regexp matching software in use works on
bytes, not characters. So, you are forced to use two periods to
match the two bytes in ž. And more, if you want to match Chinese
characters.


I am not sure if it has anything to do with utf8. The expanded string looks
like it is expanded for LaTeX. This looks quite wrong to me in context of
searching. Why is this done?


This is how advanced search works. The argument is that otherwise 
implementing regex:s would have been difficult. It would be nice to have 
a template-based regex engine that can be applied to a LyX Paragraph.


Concerning the difficult of using a joker for accented characters, I 
think indeed that utf8 is to blame.


JMarc


Re: Regular expression for non-ascii chars, advanced search

2013-04-02 Thread Jean-Marc Lasgouttes

31/03/2013 21:35, Georg Baum:

I believe this happens because the "ž" is encoded as two bytes when
using UTF-8. And I guess the regexp matching software in use works on
"bytes", not "characters". So, you are forced to use two periods to
match the two bytes in "ž". And more, if you want to match Chinese
characters.


I am not sure if it has anything to do with utf8. The expanded string looks
like it is expanded for LaTeX. This looks quite wrong to me in context of
searching. Why is this done?


This is how advanced search works. The argument is that otherwise 
implementing regex:s would have been difficult. It would be nice to have 
a template-based regex engine that can be applied to a LyX Paragraph.


Concerning the difficult of using a joker for accented characters, I 
think indeed that utf8 is to blame.


JMarc


Re: Regular expression for non-ascii chars, advanced search

2013-03-31 Thread Helge Hafting

On 29. mars 2013 13:38, Kornel Benko wrote:

I seem unable to find strings using non ascii chars (e.g. latin2)

(Please try to use UTF-8 encoding to read this mail)

The regex search string may be pou.i., so I was expecting to find

e.g. použiť. I have to use '..' to find this single chars. (pou..i..)



I believe this happens because the ž is encoded as two bytes when 
using UTF-8. And I guess the regexp matching software in use works on 
bytes, not characters. So, you are forced to use two periods to 
match the two bytes in ž. And more, if you want to match Chinese 
characters.


The solution would be regexp matching software that is unicode-aware.
A link to such software:
http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html

Helge Hafting


Re: Regular expression for non-ascii chars, advanced search

2013-03-31 Thread Georg Baum
Helge Hafting wrote:

 On 29. mars 2013 13:38, Kornel Benko wrote:
 I seem unable to find strings using non ascii chars (e.g. latin2)

 (Please try to use UTF-8 encoding to read this mail)

 The regex search string may be pou.i., so I was expecting to find

 e.g. použiť. I have to use '..' to find this single chars. (pou..i..)

 
 I believe this happens because the ž is encoded as two bytes when
 using UTF-8. And I guess the regexp matching software in use works on
 bytes, not characters. So, you are forced to use two periods to
 match the two bytes in ž. And more, if you want to match Chinese
 characters.

I am not sure if it has anything to do with utf8. The expanded string looks 
like it is expanded for LaTeX. This looks quite wrong to me in context of 
seraching. Why is this done?

 The solution would be regexp matching software that is unicode-aware.
 A link to such software:
 http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html

Indeed. A long time ago I made a similar comment in LaTeXFeatures.cpp.


Georg



Re: Regular expression for non-ascii chars, advanced search

2013-03-31 Thread Helge Hafting

On 29. mars 2013 13:38, Kornel Benko wrote:

I seem unable to find strings using non ascii chars (e.g. latin2)

(Please try to use UTF-8 encoding to read this mail)

The regex search string may be "pou.i.", so I was expecting to find

e.g. "použiť". I have to use '..' to find this single chars. ("pou..i..")



I believe this happens because the "ž" is encoded as two bytes when 
using UTF-8. And I guess the regexp matching software in use works on 
"bytes", not "characters". So, you are forced to use two periods to 
match the two bytes in "ž". And more, if you want to match Chinese 
characters.


The solution would be regexp matching software that is unicode-aware.
A link to such software:
http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html

Helge Hafting


Re: Regular expression for non-ascii chars, advanced search

2013-03-31 Thread Georg Baum
Helge Hafting wrote:

> On 29. mars 2013 13:38, Kornel Benko wrote:
>> I seem unable to find strings using non ascii chars (e.g. latin2)
>>
>> (Please try to use UTF-8 encoding to read this mail)
>>
>> The regex search string may be "pou.i.", so I was expecting to find
>>
>> e.g. "použiť". I have to use '..' to find this single chars. ("pou..i..")
>>
> 
> I believe this happens because the "ž" is encoded as two bytes when
> using UTF-8. And I guess the regexp matching software in use works on
> "bytes", not "characters". So, you are forced to use two periods to
> match the two bytes in "ž". And more, if you want to match Chinese
> characters.

I am not sure if it has anything to do with utf8. The expanded string looks 
like it is expanded for LaTeX. This looks quite wrong to me in context of 
seraching. Why is this done?

> The solution would be regexp matching software that is unicode-aware.
> A link to such software:
> http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html

Indeed. A long time ago I made a similar comment in LaTeXFeatures.cpp.


Georg