So, I did a little experimenting this weekend and found that the ICU
RegEx engine is actually really capable.
o It's fast.
o It supports {n,m} characters instead of bytes
o It even works (though a little slow) with lookaheads and lookbacks,
e.g., for words in any order:
Another possibility is to use Boost.Xpressive [1], which I think
supports the Perl regular expressions at runtime, and also static
regular expressions using C++ syntax:
using namespace boost::xpressive;
// sregex rex = sregex::compile( "(\\w+) (\\w+)!" );
sregex rex = (s1= +_w) >> ' '
Thanks, Karl,
Xiphos 4.0.4 in Windows 7 x64 gave this:
S:\>xiphos\diatheke -b KJV -s regex -k Abed...nego
Verses containing "Abed...nego"-- Daniel 1:7 ; Daniel 2:49 ; Daniel 3:12 ;
Daniel 3:13 ; Daniel 3:14 ; Daniel 3:16 ; Daniel 3:19 ; Daniel 3:20 ; Daniel
3:22 ; Daniel 3:23 ; Daniel 3:26 ;
On 03/06/2017 09:06 PM, DM Smith wrote:
> Does setting CLANG (or whatever it is) in the env help? In unix you
> have to tell the program what charset you are using.
They already come along for the ride for free as a result of logging in,
per default specification when system was installed.
$
Does setting CLANG (or whatever it is) in the env help? In unix you have to
tell the program what charset you are using.
Cent from my fone so theer mite be tipos. ;)
> On Mar 6, 2017, at 7:52 PM, Karl Kleinpaste wrote:
>
>> On 03/06/2017 05:25 PM, Greg Hellings wrote:
>>
Yeah, so this page shows that c11x regex is still mostly unsupported in gcc:
http://gcc.gnu.org/onlinedocs/libstdc++/manual/status.html#status.iso.tr1
(see section 7)
And the old school gnu regex we use otherwise I don't think knows
anything about wide chars. It simply compares bytes and
On 03/06/2017 05:25 PM, Greg Hellings wrote:
> being off by 2 would seem strange to me
I don't understand this question at all.
0xE2 = 226 = 0342
0x80 = 128 = 0200
0x93 = 147 = 0223
There's no off-by error at all.
"od" is the "octal dump" tool; given -c, it tries to dump characters,
but outside
On Mon, Mar 6, 2017 at 4:15 PM, David Haslam wrote:
> Are we sure it's an "off by 2" error and not just an email typo?
>
I'm not sure of that at all. It was my first guess, but being off by 2
would seem strange to me, as I would expect a "fat finger" error to produce
an
Are we sure it's an "off by 2" error and not just an email typo?
I wasn't expecting decimal, I just didn't parse it as octal.
David
--
View this message in context:
http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656914.html
Sent from the SWORD Dev
147 = 0223 (octal)
128 = 0200 (octal)
226 = 0340 (octal)
So it's off by 2 in the top order byte. Not sure why, but it seems you're
expecting decimal but the tool is obviously giving out octal.
--Greg
On Mon, Mar 6, 2017 at 3:02 PM, David Haslam wrote:
> Thanks Karl,
>
>
Thanks Karl,
All the "hyphenated" names in the KJV OT use the *en dash* character U+2013
which has 3 UTF-8 bytes E2 80 93.
In decimal, these are 226 128 147 so we might well wonder how your tool gave
342 200 223 ?
Best regards,
David
--
View this message in context:
On 03/03/2017 09:16 PM, Troy A. Griffitts wrote:
> SWORD supports compiling with a variety of regex engines
I have an interesting result. My previous build of sword used
--with-cxx11regex, and that failed to find Abednego in any circumstance.
Reconfiguring without that option and rebuilding, I
Corrigendum: "everything outside ASCII"
--
View this message in context:
http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656901.html
Sent from the SWORD Dev mailing list archive at Nabble.com.
___
sword-devel
Thanks Troy,
The precise /flavour/ of *regex* supported by diatheke search really needs
to be properly documented.
Expecting the *dot* to be a byte when we're handling Unicode is just not on
at all.
I'm struggling more because I'm on Windows, where the UTF-16 verse UTF-8
disparity affects
SWORD supports compiling with a variety of regex engines-- typically GNU
regex on most linux system. We include 'internal regex' copy of this,
as well. We also will compile against the C++ standard regex engine
including the language spec. Each handles unicode characters different.
. is
Created http://tracker.crosswire.org/browse/MODTOOLS-101
David
--
View this message in context:
http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656890.html
Sent from the SWORD Dev mailing list archive at Nabble.com.
So what flavour of regex does diatheke actually use under Linux?
Why is it that the *dot metacharacter* is not recognized?
David
--
View this message in context:
http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656889.html
Sent from the SWORD Dev
On 03/02/2017 02:14 PM, Greg Hellings wrote:
> I also get no results.
On the other hand...
$ mod2imp KJV | grep -B1 -i abed.nego | fgrep '$$'
$$$Daniel 1:7
$$$Daniel 2:49
$$$Daniel 3:12
$$$Daniel 3:13
$$$Daniel 3:14
$$$Daniel 3:16
$$$Daniel 3:19
$$$Daniel 3:20
$$$Daniel 3:22
$$$Daniel 3:23
Typo was only in the message, sorry!
The actual test in Windows shell with the -k there didn't give any matches.
David
--
View this message in context:
http://sword-dev.350566.n4.nabble.com/diatheke-search-type-regex-and-the-dot-tp4656879p4656884.html
Sent from the SWORD Dev mailing list
$ diatheke -b KJV -s regex -k Abed.nego
Verses containing "Abed.nego"-- none (KJV)
Once I correct the command to include the -k parameter, I also get no
results.
--Greg
On Thu, Mar 2, 2017 at 12:58 PM, David Haslam wrote:
> I was under the impression that the
I suspect this may be a further symptom of what Greg suggested as the
explanation in my other thread.
i.e. That SWORD expects to search in UTF-8 encoded text, whereas Windows
uses UTF-16 internally.
Still can't quite make out why the dot isn't treated how regular expressions
use it.
David
--
I was under the impression that the metacharacter *dot* in a regex means "any
single character".
It would seem that for diatheke with *-s regex* this is not the case at all.
Example:
diatheke -b KJV -s regex Abed.nego
In Windows command shell, that command line does not find the 15 instances
22 matches
Mail list logo