Re: [sword-devel] diatheke search type regex and the dot ?
So, I did a little experimenting this weekend and found that the ICU RegEx engine is actually really capable. o It's fast. o It supports {n,m} characters instead of bytes o It even works (though a little slow) with lookaheads and lookbacks, e.g., for words in any order: (?=.*God)(?=.*world)(?=.*love) whereas that fails to compile or simply doesn't work in our other regex engine options. So, I've added it as an option --with-icuregex and actually made it the default in usrinst.sh You can check it out from trunk or else wait for the next RC. Planning to look at the issues Peter mentioned and then push our another RC. Troy On 03/06/2017 06:17 PM, Troy A. Griffitts wrote: > > Yeah, so this page shows that c11x regex is still mostly unsupported > in gcc: > > http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1 > > (see section 7) > > And the old school gnu regex we use otherwise I don't think knows > anything about wide chars. It simply compares bytes and does have a > clue if some should be considered part of the same byte. I suspect > that because nowhere do we tell it that we're giving it UTF-8. > > Ultimately my hope is that gcc will improve eventually and solve our > problem for us. We could use > > We could add an option to use ICU RegexMatcher, but I'm still holding > out for our compiler. > > Troy > > > On 03/06/2017 05:52 PM, Karl Kleinpaste wrote: >> On 03/06/2017 05:25 PM, Greg Hellings wrote: >>> being off by 2 would seem strange to me >> I don't understand this question at all. >> >> 0xE2 = 226 = 0342 >> 0x80 = 128 = 0200 >> 0x93 = 147 = 0223 >> >> There's no off-by error at all. >> >> "od" is the "octal dump" tool; given -c, it tries to dump characters, >> but outside 7-bit ASCII, it still dumps octal. >> >> For those familiar with dc(1), this will make sense >> $ dc >> 8o >> 226p >> 342 >> 128p >> 200 >> 147p >> 223 >> 16i >> 0XE2p >> 342 >> 0X80p >> 200 >> 0X93p >> 223 >> >> The interesting questions are why C++11 regex can't find /en dash/, >> and why non-C++11 regex doesn't understand multibyte. >> >> >> ___ >> sword-devel mailing list: sword-devel@crosswire.org >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page > > > > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Another possibility is to use Boost.Xpressive [1], which I think supports the Perl regular expressions at runtime, and also static regular expressions using C++ syntax: using namespace boost::xpressive; // sregex rex = sregex::compile( "(\\w+) (\\w+)!" ); sregex rex = (s1= +_w) >> ' ' >> (s2= +_w) >> '!'; But I suppose you don't want to introduce Boost as a dependency. J [1]: http://www.boost.org/doc/libs/1_63_0/doc/html/xpressive.html On 07.03.2017 03:17, Troy A. Griffitts wrote: > Yeah, so this page shows that c11x regex is still mostly unsupported in gcc: > > http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1 > > (see section 7) > > And the old school gnu regex we use otherwise I don't think knows > anything about wide chars. It simply compares bytes and does have a > clue if some should be considered part of the same byte. I suspect that > because nowhere do we tell it that we're giving it UTF-8. > > Ultimately my hope is that gcc will improve eventually and solve our > problem for us. We could use > > We could add an option to use ICU RegexMatcher, but I'm still holding > out for our compiler. > > Troy > > > On 03/06/2017 05:52 PM, Karl Kleinpaste wrote: >> On 03/06/2017 05:25 PM, Greg Hellings wrote: >>> being off by 2 would seem strange to me >> I don't understand this question at all. >> >> 0xE2 = 226 = 0342 >> 0x80 = 128 = 0200 >> 0x93 = 147 = 0223 >> >> There's no off-by error at all. >> >> "od" is the "octal dump" tool; given -c, it tries to dump characters, >> but outside 7-bit ASCII, it still dumps octal. >> >> For those familiar with dc(1), this will make sense >> $ dc >> 8o >> 226p >> 342 >> 128p >> 200 >> 147p >> 223 >> 16i >> 0XE2p >> 342 >> 0X80p >> 200 >> 0X93p >> 223 >> >> The interesting questions are why C++11 regex can't find /en dash/, >> and why non-C++11 regex doesn't understand multibyte. >> >> >> ___ >> sword-devel mailing list: sword-devel@crosswire.org >> http://www.crosswire.org/mailman/listinfo/sword-devel >> Instructions to unsubscribe/change your settings at above page > > > > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Thanks, Karl, Xiphos 4.0.4 in Windows 7 x64 gave this: S:\>xiphos\diatheke -b KJV -s regex -k Abed...nego Verses containing "Abed...nego"-- Daniel 1:7 ; Daniel 2:49 ; Daniel 3:12 ; Daniel 3:13 ; Daniel 3:14 ; Daniel 3:16 ; Daniel 3:19 ; Daniel 3:20 ; Daniel 3:22 ; Daniel 3:23 ; Daniel 3:26 ; Daniel 3:28 ; Daniel 3:29 ; Daniel 3:30 -- 14 matches total (KJV) It's evident that in Windows it behaves like it did in Linux after you recompiled without cxx11regex. Question: Does *regex* mean the same to diatheke search as it does for Xiphos advanced search? Best regards, David PS. I'm sure we can all forgive Greg for the mistaken "off by 2" claim. -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656920.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
On 03/06/2017 09:06 PM, DM Smith wrote: > Does setting CLANG (or whatever it is) in the env help? In unix you > have to tell the program what charset you are using. They already come along for the ride for free as a result of logging in, per default specification when system was installed. $ env|grep -i utf LC_ALL=en_US.utf8 LANG=en_US.utf8 ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Does setting CLANG (or whatever it is) in the env help? In unix you have to tell the program what charset you are using. Cent from my fone so theer mite be tipos. ;) > On Mar 6, 2017, at 7:52 PM, Karl Kleinpastewrote: > >> On 03/06/2017 05:25 PM, Greg Hellings wrote: >> being off by 2 would seem strange to me > I don't understand this question at all. > > 0xE2 = 226 = 0342 > 0x80 = 128 = 0200 > 0x93 = 147 = 0223 > > There's no off-by error at all. > > "od" is the "octal dump" tool; given -c, it tries to dump characters, but > outside 7-bit ASCII, it still dumps octal. > > For those familiar with dc(1), this will make sense > $ dc > 8o > 226p > 342 > 128p > 200 > 147p > 223 > 16i > 0XE2p > 342 > 0X80p > 200 > 0X93p > 223 > > The interesting questions are why C++11 regex can't find en dash, and why > non-C++11 regex doesn't understand multibyte. > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Yeah, so this page shows that c11x regex is still mostly unsupported in gcc: http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1 (see section 7) And the old school gnu regex we use otherwise I don't think knows anything about wide chars. It simply compares bytes and does have a clue if some should be considered part of the same byte. I suspect that because nowhere do we tell it that we're giving it UTF-8. Ultimately my hope is that gcc will improve eventually and solve our problem for us. We could use We could add an option to use ICU RegexMatcher, but I'm still holding out for our compiler. Troy On 03/06/2017 05:52 PM, Karl Kleinpaste wrote: On 03/06/2017 05:25 PM, Greg Hellings wrote: being off by 2 would seem strange to me I don't understand this question at all. 0xE2 = 226 = 0342 0x80 = 128 = 0200 0x93 = 147 = 0223 There's no off-by error at all. "od" is the "octal dump" tool; given -c, it tries to dump characters, but outside 7-bit ASCII, it still dumps octal. For those familiar with dc(1), this will make sense $ dc 8o 226p 342 128p 200 147p 223 16i 0XE2p 342 0X80p 200 0X93p 223 The interesting questions are why C++11 regex can't find /en dash/, and why non-C++11 regex doesn't understand multibyte. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
On 03/06/2017 05:25 PM, Greg Hellings wrote: > being off by 2 would seem strange to me I don't understand this question at all. 0xE2 = 226 = 0342 0x80 = 128 = 0200 0x93 = 147 = 0223 There's no off-by error at all. "od" is the "octal dump" tool; given -c, it tries to dump characters, but outside 7-bit ASCII, it still dumps octal. For those familiar with dc(1), this will make sense $ dc 8o 226p 342 128p 200 147p 223 16i 0XE2p 342 0X80p 200 0X93p 223 The interesting questions are why C++11 regex can't find /en dash/, and why non-C++11 regex doesn't understand multibyte. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
On Mon, Mar 6, 2017 at 4:15 PM, David Haslamwrote: > Are we sure it's an "off by 2" error and not just an email typo? > I'm not sure of that at all. It was my first guess, but being off by 2 would seem strange to me, as I would expect a "fat finger" error to produce an off-by-1 or a spurious extra digit added. But Karl would need to verify that. > > I wasn't expecting decimal, I just didn't parse it as octal. > In the context of Octal, the values make the most sense as a typo on one side or the other, to me. --Greg > > David > > > > -- > View this message in context: http://sword-dev.350566.n4. > nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656914.html > Sent from the SWORD Dev mailing list archive at Nabble.com. > > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Are we sure it's an "off by 2" error and not just an email typo? I wasn't expecting decimal, I just didn't parse it as octal. David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656914.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
147 = 0223 (octal) 128 = 0200 (octal) 226 = 0340 (octal) So it's off by 2 in the top order byte. Not sure why, but it seems you're expecting decimal but the tool is obviously giving out octal. --Greg On Mon, Mar 6, 2017 at 3:02 PM, David Haslamwrote: > Thanks Karl, > > All the "hyphenated" names in the KJV OT use the *en dash* character U+2013 > which has 3 UTF-8 bytes E2 80 93. > > In decimal, these are 226 128 147 so we might well wonder how your tool > gave > 342 200 223 ? > > Best regards, > > David > > > > -- > View this message in context: http://sword-dev.350566.n4. > nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656912.html > Sent from the SWORD Dev mailing list archive at Nabble.com. > > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Thanks Karl, All the "hyphenated" names in the KJV OT use the *en dash* character U+2013 which has 3 UTF-8 bytes E2 80 93. In decimal, these are 226 128 147 so we might well wonder how your tool gave 342 200 223 ? Best regards, David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656912.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
On 03/03/2017 09:16 PM, Troy A. Griffitts wrote: > SWORD supports compiling with a variety of regex engines I have an interesting result. My previous build of sword used --with-cxx11regex, and that failed to find Abednego in any circumstance. Reconfiguring without that option and rebuilding, I now get this result: $ diatheke -b KJV -s regex -k Abednego Entries containing "Abednego"-- none (KJV) $ diatheke -b KJV -s regex -k Abed...nego Entries containing "Abed...nego"-- Daniel 1:7Daniel 2:49 ; Daniel 3:12 ; Daniel 3:13 ; Daniel 3:14 ; Daniel 3:16 ; Daniel 3:19 ; Daniel 3:20 ; Daniel 3:22 ; Daniel 3:23 ; Daniel 3:26 ; Daniel 3:28 ; Daniel 3:29 ; Daniel 3:30 ; -- 14 matches total (KJV) $ diatheke -b KJV -s regex -k Abed..nego Entries containing "Abed..nego"-- none (KJV) $ diatheke -b KJV -s regex -k Abed.nego Entries containing "Abed.nego"-- none (KJV) What's important here is that the dash in the middle of "Abed-nego" in KJV appears as (from Dan.3.30, passed through "od -c"): 360 d A b e d 342 200 223 n e g o < / w So diatheke with C++11 regex fails entirely, and diatheke without C++11 regex finds it only when the 3 component bytes of the dash character are specified individually, which is to say, unaware of multibyte encoding at all. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Corrigendum: "everything outside ASCII" -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656901.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Thanks Troy, The precise /flavour/ of *regex* supported by diatheke search really needs to be properly documented. Expecting the *dot* to be a byte when we're handling Unicode is just not on at all. I'm struggling more because I'm on Windows, where the UTF-16 verse UTF-8 disparity affects everything outside ANSI, but even the friends using diatheke in Linux are having no success with the dot. The character class *[.,;:]* treats it as just a full-stop punctuation mark. cf. I'm so used to having to escape the full-stop in most other contexts. (e.g. Notepad++ search, TextPipe replace filters, etc). If *regex* is to be of any real use, we shouldn't leave users to resort to trial and error to see what works. David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656900.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
SWORD supports compiling with a variety of regex engines-- typically GNU regex on most linux system. We include 'internal regex' copy of this, as well. We also will compile against the C++ standard regex engine including the language spec. Each handles unicode characters different. . is certainly recognized, but I would guess that in whatever regex library you are using during compile, it represents a byte and not a literal character. Try .{1-6} On 03/03/2017 07:36 AM, David Haslam wrote: Created http://tracker.crosswire.org/browse/MODTOOLS-101 David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656890.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Created http://tracker.crosswire.org/browse/MODTOOLS-101 David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656890.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
So what flavour of regex does diatheke actually use under Linux? Why is it that the *dot metacharacter* is not recognized? David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656889.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
On 03/02/2017 02:14 PM, Greg Hellings wrote: > I also get no results. On the other hand... $ mod2imp KJV | grep -B1 -i abed.nego | fgrep '$$' $$$Daniel 1:7 $$$Daniel 2:49 $$$Daniel 3:12 $$$Daniel 3:13 $$$Daniel 3:14 $$$Daniel 3:16 $$$Daniel 3:19 $$$Daniel 3:20 $$$Daniel 3:22 $$$Daniel 3:23 $$$Daniel 3:26 $$$Daniel 3:28 $$$Daniel 3:29 $$$Daniel 3:30 Plain old regular expression search ("grep" origin is g/re/p, the ancient syntax in UNIX' original line editor for "global regular expression print") finds them. grep is locale-sensitive. and I have LC_ALL=en_US.utf8. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
Typo was only in the message, sorry! The actual test in Windows shell with the -k there didn't give any matches. David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656884.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
$ diatheke -b KJV -s regex -k Abed.nego Verses containing "Abed.nego"-- none (KJV) Once I correct the command to include the -k parameter, I also get no results. --Greg On Thu, Mar 2, 2017 at 12:58 PM, David Haslamwrote: > I was under the impression that the metacharacter *dot* in a regex means > "any > single character". > > It would seem that for diatheke with *-s regex* this is not the case at > all. > > Example: > > diatheke -b KJV -s regex Abed.nego > > In Windows command shell, that command line does not find the 15 instances > of the name *Abed–nego* where the *en dash* (U+2013) is the punctuation > mark > in all such names. > > What happens in Linux? > > David > > > > > -- > View this message in context: http://sword-dev.350566.n4. > nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879.html > Sent from the SWORD Dev mailing list archive at Nabble.com. > > ___ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page
Re: [sword-devel] diatheke search type regex and the dot ?
I suspect this may be a further symptom of what Greg suggested as the explanation in my other thread. i.e. That SWORD expects to search in UTF-8 encoded text, whereas Windows uses UTF-16 internally. Still can't quite make out why the dot isn't treated how regular expressions use it. David -- View this message in context: http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656881.html Sent from the SWORD Dev mailing list archive at Nabble.com. ___ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page