Re: [MarkLogic Dev General] Regular expression bug

Danny Sokolsky Thu, 27 May 2010 19:23:35 -0700

Hi Chris,



Sorry this one slipped through the cracks for a while...better late than never.



We think this is not actually a bug, even though it appears so at first glance. 
The reason is that the specification is vague and leaves certain details to the 
implementation. It says that ungreedy/reluctant quantifiers are required to 
match the shortest possible substring, but it does not give rules for the 
priority of sub-expressions/capturing groups. So, it's up to the 
implementation. If you want to read some gory details, check out 
http://www.w3.org/TR/xpath-functions/#string.match.



POSIX does define such rules, but it doesn't have the notion of ungreedy 
quantifiers. Perl doesn't try to define the rules; there is no such thing as a 
Perl specification. The closest thing is a description of the implementation, 
which is an inherently low-performance approach involving trying one match at a 
time (i.e. backtracking). This is not a great approach.



So the MarkLogic implementation chose the more performant approach.



In the 1.0-ml dialect, however, there is an undocumented “p” flag to the 
functions that take a regex that does the perl-like matching (it is an 
extension to the spec, so it is not available in the 1.0 dialect).  I think 
your workaround is a better approach, however.



Hope that helps,

-Danny


From: general-boun...@developer.marklogic.com 
[mailto:general-boun...@developer.marklogic.com] On Behalf Of Chris Maloney
Sent: Friday, May 21, 2010 2:54 PM
To: General Mark Logic Developer Discussion
Subject: Re: [MarkLogic Dev General] Regular expression bug

Work-around:
I discovered that if I enclosed the first part of the expression in parens, it 
works.
E.g.  replace($s, "^(a*?)b+", "x"): "xcc"
On Fri, May 21, 2010 at 2:56 PM, Maloney, Christopher (NIH/NLM/NCBI) [C] 
<malon...@ncbi.nlm.nih.gov<mailto:malon...@ncbi.nlm.nih.gov>> wrote:
xquery version "1.0";

let $s := "aabbcc"
return
  <html xmlns="http://www.w3.org/1999/xhtml";>
    <head>
      <title>regex-anomaly.xqy</title>
    </head>
    <body>
      <h1>regex-anomaly.xqy</h1>
      <p>
        Demonstrate a MarkLogic regular expression bug.
      </p>
      <p>
        Test string is "{$s}"
      </p>
      <p>
        <b>Works correctly:</b>  match as many "a"s as you can
        (greedily),
        then one or more "b", and replace with an "x", we expect
        "xcc":
      </p>
      <blockquote>
        replace($s, "^a*b+", "x"):
        "{replace($s, "^a*b+", "x")}"
      </blockquote>
      <p>
        <b>Fails:</b>  note how the question mark turns off
        greedy matching for the "+" sign, the result should
        be the same "xcc", but I am seeing "xbcc":
      </p>
      <blockquote>
        replace($s, "^a*?b+", "x"):
        "{replace($s, "^a*?b+", "x")}"
      </blockquote>
      <p>
        I tested this same expression using Perl and using
        Saxon, and they both give the expected results "xcc"
        in both cases.
      </p>
    </body>
  </html>




Chris Maloney
NIH/NLM/NCBI (Contractor)
Building 45, 5AN.36D-17
301-443-6461


_______________________________________________
General mailing list
General@developer.marklogic.com<mailto:General@developer.marklogic.com>
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Regular expression bug

Reply via email to