Re: Re: Regular expression for non-ascii chars, advanced search
Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta tomm...@lyx.org So, I came up with this trivial patch for the kind of scenario you proposed. Simply, export an regexp inset using the text, rather than math, encoding rules. AFAICS, one might usefully be willing to write text (and special chars) in a regexp context. For normal text it is wonderful :). However, it's not conclusive, nor can it be. Imagine I write the word you were mentioning (použiť) both as regular text in a document, AND within a math inset. Writing as textrm inside math, this new algorithm finds the text. Previously it was not the case. If I uncheck 'ignore format', then I cannot find any (regular or not) string now :(. (With or without non-ascii). But the behaviour seems undefined. The next time I tried to search, with a string (copy paste), it could find the string (iff all ascci). More test on this confused me even more. I could not see any regularity when the string will be found and when not. And not ignoring format is still slow. Then, I search for it through Advanced Find. If I enter the word as simple text in the Find box, then it finds only the text counter-part in the document, but it cannot match the math one. If I enter the word in math mode, then it's the other way round. If I enter the word in regexp mode, then I match one or the other depending on whether you applied my attached patch :-). Now, such a behaviour might have been OK for Ignore Format unchecked, but it happens when it's checked as well, and it shouldn't happen. This is probably one of the many other Advanced Find scenarios that can be addressed by modifying the export/write/latex logic introducing a special export mode that carries along the matching options, and lets insets export what makes sense and is appropriate considering them, rather than trying to fix the situation through impossible regexp post-processing after the export (the current implementation is very fragile, if one tries to search for {{{, or \regexp, or a combination of them, or similar, I don't know what can happen). Such a focused export for advanced FR should also speed up tremendously the operation. comments ? T. I think, own export format would be best. Kornel signature.asc Description: This is a digitally signed message part.
Re: Regular expression for non-ascii chars, advanced search
On 07/04/13 09:34, Kornel Benko wrote: Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta tomm...@lyx.org I came up with this trivial patch for the kind of scenario you proposed. Simply, export an regexp inset using the text, rather than math, encoding rules. AFAICS, one might usefully be willing to write text (and special chars) in a regexp context. For normal text it is wonderful :). So, it's in: [6a3792bd/lyxgit]. In the end, using non-ASCII for finding in regular text seems more needed than in maths. If you can spot other issues after this commit, pls let me know (except for finding the non-ASCII in maths through regexps, which now doesn't work and we know, but perhaps one day I'll find some time to rework the engine). Thanks, T.
Re: Re: Regular expression for non-ascii chars, advanced search
Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta> So, > > I came up with this trivial patch for the kind of scenario you proposed. > Simply, > export an regexp inset using the text, rather than math, "encoding" rules. > AFAICS, one might usefully be willing to write text (and special chars) in a > regexp context. For normal text it is wonderful :). > However, it's not conclusive, nor can it be. > > Imagine I write the word you were mentioning (použiť) both as regular text in > a document, AND within a math inset. Writing as textrm inside math, this new algorithm finds the text. Previously it was not the case. If I uncheck 'ignore format', then I cannot find any (regular or not) string now :(. (With or without non-ascii). But the behaviour seems undefined. The next time I tried to search, with a string (copy & paste), it could find the string (iff all ascci). More test on this confused me even more. I could not see any regularity when the string will be found and when not. And not ignoring format is still slow. > Then, I search for it through Advanced Find. > > If I enter the word as simple text in the Find box, then it finds only the > text > counter-part in the document, but it cannot match the math one. If I enter the > word in math mode, then it's the other way round. If I enter the word in > regexp > mode, then I match one or the other depending on whether you applied my > attached > patch :-). > > Now, such a behaviour might have been OK for Ignore Format unchecked, but it > happens when it's checked as well, and it shouldn't happen. > > This is probably one of the many other Advanced Find scenarios that can be > addressed by modifying the export/write/latex logic introducing a special > export mode that carries along the matching options, and lets insets export > what makes sense and is appropriate considering them, rather than trying to > fix the situation through impossible regexp post-processing after the export > (the current implementation is very fragile, if one tries to search for > "{{{", or "\regexp", or a combination of them, or similar, I don't know what > can happen). Such a focused export for advanced F should also speed up > tremendously the operation. > > comments ? > >T. I think, own export format would be best. Kornel signature.asc Description: This is a digitally signed message part.
Re: Regular expression for non-ascii chars, advanced search
On 07/04/13 09:34, Kornel Benko wrote: > Am Sonntag, 7. April 2013 um 01:36:52, schrieb Tommaso Cucinotta >>> I came up with this trivial patch for the kind of scenario you proposed. >> Simply, >> export an regexp inset using the text, rather than math, "encoding" rules. >> AFAICS, one might usefully be willing to write text (and special chars) in a >> regexp context. > For normal text it is wonderful :). So, it's in: [6a3792bd/lyxgit]. In the end, using non-ASCII for finding in regular text seems more needed than in maths. If you can spot other issues after this commit, pls let me know (except for finding the non-ASCII in maths through regexps, which now doesn't work and we know, but perhaps one day I'll find some time to rework the engine). Thanks, T.
Re: Regular expression for non-ascii chars, advanced search
On 03/04/13 22:40, Kornel Benko wrote: I want to find (as regular expression) the string použiť. In tex, it looks použi\v{t}. But the searched string (as it is diplayed while filling the search form) it is \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}. Ok, I could reproduce thanks to the file you sent me. From a first impression, it seems to me that the problem is NOT the regexp matching engine. Indeed, when searching with regexp but with no non-ASCII char in the regexp, it works fine and it finds for example použ\regexp{i\endregexp{}}ť However, when the regexp contains non-ASCII chars, then it's misinterpreted in the text conversion. I suspect it's due to the fact that the regexp inset has been essentially derived from a math inset, so it's not expecting any non-ASCII stuff therein, and it's not applying the regular non-ASCII chars mangling that is instead done correctly for text. I'll try to look into it. T.
Re: Re: Regular expression for non-ascii chars, advanced search
Am Samstag, 6. April 2013 um 19:32:58, schrieb Tommaso Cucinotta tomm...@lyx.org On 03/04/13 22:40, Kornel Benko wrote: I want to find (as regular expression) the string použiť. In tex, it looks použi\v{t}. But the searched string (as it is diplayed while filling the search form) it is \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}. Ok, I could reproduce thanks to the file you sent me. From a first impression, it seems to me that the problem is NOT the regexp matching engine. Indeed, when searching with regexp but with no non-ASCII char in the regexp, it works fine and it finds for example použ\regexp{i\endregexp{}}ť However, when the regexp contains non-ASCII chars, then it's misinterpreted in the text conversion. I suspect it's due to the fact that the regexp inset has been essentially derived from a math inset, so it's not expecting any non-ASCII stuff therein, and it's not applying the regular non-ASCII chars mangling that is instead done correctly for text. I'll try to look into it. T. Yes, all non-ascii characters are treated as being part of math. Therefore they are replaced according to our unicode-file. In most cases we respect the type of InsetMathHull (for regex it is hullRegexp). Somewhere we split the search string at non-ascii, but I failed to find where. Kornel signature.asc Description: This is a digitally signed message part.
Re: Regular expression for non-ascii chars, advanced search
On 06/04/13 20:04, Kornel Benko wrote: However, when the regexp contains non-ASCII chars, then it's misinterpreted in the text conversion. I suspect it's due to the fact that the regexp inset has been essentially derived from a math inset, so it's not expecting any non-ASCII stuff therein, and it's not applying the regular non-ASCII chars mangling that is instead done correctly for text. I'll try to look into it. So, I came up with this trivial patch for the kind of scenario you proposed. Simply, export an regexp inset using the text, rather than math, encoding rules. AFAICS, one might usefully be willing to write text (and special chars) in a regexp context. However, it's not conclusive, nor can it be. Imagine I write the word you were mentioning (použiť) both as regular text in a document, AND within a math inset. Then, I search for it through Advanced Find. If I enter the word as simple text in the Find box, then it finds only the text counter-part in the document, but it cannot match the math one. If I enter the word in math mode, then it's the other way round. If I enter the word in regexp mode, then I match one or the other depending on whether you applied my attached patch :-). Now, such a behaviour might have been OK for Ignore Format unchecked, but it happens when it's checked as well, and it shouldn't happen. This is probably one of the many other Advanced Find scenarios that can be addressed by modifying the export/write/latex logic introducing a special export mode that carries along the matching options, and lets insets export what makes sense and is appropriate considering them, rather than trying to fix the situation through impossible regexp post-processing after the export (the current implementation is very fragile, if one tries to search for {{{, or \regexp, or a combination of them, or similar, I don't know what can happen). Such a focused export for advanced FR should also speed up tremendously the operation. comments ? T. diff --git a/src/mathed/InsetMathHull.cpp b/src/mathed/InsetMathHull.cpp index 7002a9b0..4cfdbca8 100644 --- a/src/mathed/InsetMathHull.cpp +++ b/src/mathed/InsetMathHull.cpp @@ -1240,7 +1240,8 @@ docstring InsetMathHull::eolString(row_type row, bool fragile, bool latex, void InsetMathHull::write(WriteStream os) const { - ModeSpecifier specifier(os, MATH_MODE); + ModeSpecifier specifier(os, + type_ == hullRegexp ? TEXT_MODE : MATH_MODE); header_write(os); InsetMathGrid::write(os); footer_write(os);
Re: Regular expression for non-ascii chars, advanced search
On 03/04/13 22:40, Kornel Benko wrote: > I want to find (as regular expression) the string "použiť". In tex, it looks > "použi\v{t}". > But the searched string (as it is diplayed while filling the search form) it > is > "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}". Ok, I could reproduce thanks to the file you sent me. From a first impression, it seems to me that the problem is NOT the regexp matching engine. Indeed, when searching with regexp but with no non-ASCII char in the regexp, it works fine and it finds for example použ\regexp{i\endregexp{}}ť However, when the regexp contains non-ASCII chars, then it's misinterpreted in the text conversion. I suspect it's due to the fact that the regexp inset has been essentially derived from a math inset, so it's not expecting any non-ASCII stuff therein, and it's not applying the regular non-ASCII chars mangling that is instead done correctly for text. I'll try to look into it. T.
Re: Re: Regular expression for non-ascii chars, advanced search
Am Samstag, 6. April 2013 um 19:32:58, schrieb Tommaso Cucinotta> On 03/04/13 22:40, Kornel Benko wrote: > > I want to find (as regular expression) the string "použiť". In tex, it looks > > "použi\v{t}". > > But the searched string (as it is diplayed while filling the search form) > > it is > > "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}". > > Ok, I could reproduce thanks to the file you sent me. From a first impression, > it seems to me that the problem is NOT the regexp matching engine. Indeed, > when > searching with regexp but with no non-ASCII char in the regexp, it works fine > and it finds for example > > použ\regexp{i\endregexp{}}ť > > However, when the regexp contains non-ASCII chars, then it's misinterpreted > in the text conversion. I suspect it's due to the fact that the regexp inset > has been essentially derived from a math inset, so it's not expecting any > non-ASCII stuff therein, and it's not applying the > regular non-ASCII chars mangling that is instead done correctly for text. > > I'll try to look into it. > > T. Yes, all non-ascii characters are treated as being part of math. Therefore they are replaced according to our unicode-file. In most cases we respect the type of InsetMathHull (for regex it is hullRegexp). Somewhere we split the search string at non-ascii, but I failed to find where. Kornel signature.asc Description: This is a digitally signed message part.
Re: Regular expression for non-ascii chars, advanced search
On 06/04/13 20:04, Kornel Benko wrote: >> However, when the regexp contains non-ASCII chars, then it's misinterpreted >> in the text conversion. I suspect it's due to the fact that the regexp inset >> has been essentially derived from a math inset, so it's not expecting any >> non-ASCII stuff therein, and it's not applying the >> regular non-ASCII chars mangling that is instead done correctly for text. >> I'll try to look into it. So, I came up with this trivial patch for the kind of scenario you proposed. Simply, export an regexp inset using the text, rather than math, "encoding" rules. AFAICS, one might usefully be willing to write text (and special chars) in a regexp context. However, it's not conclusive, nor can it be. Imagine I write the word you were mentioning (použiť) both as regular text in a document, AND within a math inset. Then, I search for it through Advanced Find. If I enter the word as simple text in the Find box, then it finds only the text counter-part in the document, but it cannot match the math one. If I enter the word in math mode, then it's the other way round. If I enter the word in regexp mode, then I match one or the other depending on whether you applied my attached patch :-). Now, such a behaviour might have been OK for Ignore Format unchecked, but it happens when it's checked as well, and it shouldn't happen. This is probably one of the many other Advanced Find scenarios that can be addressed by modifying the export/write/latex logic introducing a special export mode that carries along the matching options, and lets insets export what makes sense and is appropriate considering them, rather than trying to fix the situation through impossible regexp post-processing after the export (the current implementation is very fragile, if one tries to search for "{{{", or "\regexp", or a combination of them, or similar, I don't know what can happen). Such a focused export for advanced F should also speed up tremendously the operation. comments ? T. diff --git a/src/mathed/InsetMathHull.cpp b/src/mathed/InsetMathHull.cpp index 7002a9b0..4cfdbca8 100644 --- a/src/mathed/InsetMathHull.cpp +++ b/src/mathed/InsetMathHull.cpp @@ -1240,7 +1240,8 @@ docstring InsetMathHull::eolString(row_type row, bool fragile, bool latex, void InsetMathHull::write(WriteStream & os) const { - ModeSpecifier specifier(os, MATH_MODE); + ModeSpecifier specifier(os, + type_ == hullRegexp ? TEXT_MODE : MATH_MODE); header_write(os); InsetMathGrid::write(os); footer_write(os);
Re: Regular expression for non-ascii chars, advanced search
On 02/04/13 10:28, Jean-Marc Lasgouttes wrote: I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of searching. Why is this done? This is how advanced search works. Confirm. This is how the current implementation works. Nonetheless, alternative (and more efficient) implementations of the idea might be possible. T.
Re: Re: Regular expression for non-ascii chars, advanced search
Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta tomm...@lyx.org On 02/04/13 10:28, Jean-Marc Lasgouttes wrote: I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of searching. Why is this done? This is how advanced search works. Confirm. This is how the current implementation works. Nonetheless, alternative (and more efficient) implementations of the idea might be possible. T. I disagree (not confirming). I want to find (as regular expression) the string použiť. In tex, it looks použi\v{t}. But the searched string (as it is diplayed while filling the search form) it is \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}. It looks more, like it were interpreting mathematical input. Kornel signature.asc Description: This is a digitally signed message part.
Re: Regular expression for non-ascii chars, advanced search
I cannot enter those chars on my keyboard. Can u provide a sample lyx file where I can search and see the problem ? T. On 03/04/13 22:40, Kornel Benko wrote: Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta tomm...@lyx.org On 02/04/13 10:28, Jean-Marc Lasgouttes wrote: I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of searching. Why is this done? This is how advanced search works. Confirm. This is how the current implementation works. Nonetheless, alternative (and more efficient) implementations of the idea might be possible. T. I disagree (not confirming). I want to find (as regular expression) the string použiť. In tex, it looks použi\v{t}. But the searched string (as it is diplayed while filling the search form) it is \regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}. It looks more, like it were interpreting mathematical input. Kornel
Re: Regular expression for non-ascii chars, advanced search
On 02/04/13 10:28, Jean-Marc Lasgouttes wrote: >> I am not sure if it has anything to do with utf8. The expanded string looks >> like it is expanded for LaTeX. This looks quite wrong to me in context of >> searching. Why is this done? > > This is how advanced search works. Confirm. This is how the current implementation works. Nonetheless, alternative (and more efficient) implementations of the idea might be possible. T.
Re: Re: Regular expression for non-ascii chars, advanced search
Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta> On 02/04/13 10:28, Jean-Marc Lasgouttes wrote: > >> I am not sure if it has anything to do with utf8. The expanded string looks > >> like it is expanded for LaTeX. This looks quite wrong to me in context of > >> searching. Why is this done? > > > > This is how advanced search works. > > Confirm. This is how the current implementation works. Nonetheless, > alternative (and more efficient) implementations of the idea might be > possible. > > T. I disagree (not confirming). I want to find (as regular expression) the string "použiť". In tex, it looks "použi\v{t}". But the searched string (as it is diplayed while filling the search form) it is "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}". It looks more, like it were interpreting mathematical input. Kornel signature.asc Description: This is a digitally signed message part.
Re: Regular expression for non-ascii chars, advanced search
I cannot enter those chars on my keyboard. Can u provide a sample lyx file where I can search and see the problem ? T. On 03/04/13 22:40, Kornel Benko wrote: > Am Mittwoch, 3. April 2013 um 20:28:59, schrieb Tommaso Cucinotta >> >> On 02/04/13 10:28, Jean-Marc Lasgouttes wrote: > >> >> I am not sure if it has anything to do with utf8. The expanded string >> >> looks > >> >> like it is expanded for LaTeX. This looks quite wrong to me in context of > >> >> searching. Why is this done? > >> > > >> > This is how advanced search works. > >> > >> Confirm. This is how the current implementation works. Nonetheless, > >> alternative (and more efficient) implementations of the idea might be >> possible. > >> > >> T. > > > > I disagree (not confirming). > > > > I want to find (as regular expression) the string "použiť". In tex, it looks > > "použi\v{t}". > > But the searched string (as it is diplayed while filling the search form) it > is > > "\regexp{pou\check{z} it\mkern-5mu\mathchar19\endregexp{}}". > > > > It looks more, like it were interpreting mathematical input. > > > > Kornel >
Re: Regular expression for non-ascii chars, advanced search
31/03/2013 21:35, Georg Baum: I believe this happens because the ž is encoded as two bytes when using UTF-8. And I guess the regexp matching software in use works on bytes, not characters. So, you are forced to use two periods to match the two bytes in ž. And more, if you want to match Chinese characters. I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of searching. Why is this done? This is how advanced search works. The argument is that otherwise implementing regex:s would have been difficult. It would be nice to have a template-based regex engine that can be applied to a LyX Paragraph. Concerning the difficult of using a joker for accented characters, I think indeed that utf8 is to blame. JMarc
Re: Regular expression for non-ascii chars, advanced search
31/03/2013 21:35, Georg Baum: I believe this happens because the "ž" is encoded as two bytes when using UTF-8. And I guess the regexp matching software in use works on "bytes", not "characters". So, you are forced to use two periods to match the two bytes in "ž". And more, if you want to match Chinese characters. I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of searching. Why is this done? This is how advanced search works. The argument is that otherwise implementing regex:s would have been difficult. It would be nice to have a template-based regex engine that can be applied to a LyX Paragraph. Concerning the difficult of using a joker for accented characters, I think indeed that utf8 is to blame. JMarc
Re: Regular expression for non-ascii chars, advanced search
On 29. mars 2013 13:38, Kornel Benko wrote: I seem unable to find strings using non ascii chars (e.g. latin2) (Please try to use UTF-8 encoding to read this mail) The regex search string may be pou.i., so I was expecting to find e.g. použiť. I have to use '..' to find this single chars. (pou..i..) I believe this happens because the ž is encoded as two bytes when using UTF-8. And I guess the regexp matching software in use works on bytes, not characters. So, you are forced to use two periods to match the two bytes in ž. And more, if you want to match Chinese characters. The solution would be regexp matching software that is unicode-aware. A link to such software: http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html Helge Hafting
Re: Regular expression for non-ascii chars, advanced search
Helge Hafting wrote: On 29. mars 2013 13:38, Kornel Benko wrote: I seem unable to find strings using non ascii chars (e.g. latin2) (Please try to use UTF-8 encoding to read this mail) The regex search string may be pou.i., so I was expecting to find e.g. použiť. I have to use '..' to find this single chars. (pou..i..) I believe this happens because the ž is encoded as two bytes when using UTF-8. And I guess the regexp matching software in use works on bytes, not characters. So, you are forced to use two periods to match the two bytes in ž. And more, if you want to match Chinese characters. I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of seraching. Why is this done? The solution would be regexp matching software that is unicode-aware. A link to such software: http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html Indeed. A long time ago I made a similar comment in LaTeXFeatures.cpp. Georg
Re: Regular expression for non-ascii chars, advanced search
On 29. mars 2013 13:38, Kornel Benko wrote: I seem unable to find strings using non ascii chars (e.g. latin2) (Please try to use UTF-8 encoding to read this mail) The regex search string may be "pou.i.", so I was expecting to find e.g. "použiť". I have to use '..' to find this single chars. ("pou..i..") I believe this happens because the "ž" is encoded as two bytes when using UTF-8. And I guess the regexp matching software in use works on "bytes", not "characters". So, you are forced to use two periods to match the two bytes in "ž". And more, if you want to match Chinese characters. The solution would be regexp matching software that is unicode-aware. A link to such software: http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html Helge Hafting
Re: Regular expression for non-ascii chars, advanced search
Helge Hafting wrote: > On 29. mars 2013 13:38, Kornel Benko wrote: >> I seem unable to find strings using non ascii chars (e.g. latin2) >> >> (Please try to use UTF-8 encoding to read this mail) >> >> The regex search string may be "pou.i.", so I was expecting to find >> >> e.g. "použiť". I have to use '..' to find this single chars. ("pou..i..") >> > > I believe this happens because the "ž" is encoded as two bytes when > using UTF-8. And I guess the regexp matching software in use works on > "bytes", not "characters". So, you are forced to use two periods to > match the two bytes in "ž". And more, if you want to match Chinese > characters. I am not sure if it has anything to do with utf8. The expanded string looks like it is expanded for LaTeX. This looks quite wrong to me in context of seraching. Why is this done? > The solution would be regexp matching software that is unicode-aware. > A link to such software: > http://abies.nmsu.edu/pkgsrc/boost/libs/regex/doc/icu_strings.html Indeed. A long time ago I made a similar comment in LaTeXFeatures.cpp. Georg