Hi Chris,
Sorry this one slipped through the cracks for a while...better late than never. We think this is not actually a bug, even though it appears so at first glance. The reason is that the specification is vague and leaves certain details to the implementation. It says that ungreedy/reluctant quantifiers are required to match the shortest possible substring, but it does not give rules for the priority of sub-expressions/capturing groups. So, it's up to the implementation. If you want to read some gory details, check out http://www.w3.org/TR/xpath-functions/#string.match. POSIX does define such rules, but it doesn't have the notion of ungreedy quantifiers. Perl doesn't try to define the rules; there is no such thing as a Perl specification. The closest thing is a description of the implementation, which is an inherently low-performance approach involving trying one match at a time (i.e. backtracking). This is not a great approach. So the MarkLogic implementation chose the more performant approach. In the 1.0-ml dialect, however, there is an undocumented āpā flag to the functions that take a regex that does the perl-like matching (it is an extension to the spec, so it is not available in the 1.0 dialect). I think your workaround is a better approach, however. Hope that helps, -Danny From: general-boun...@developer.marklogic.com [mailto:general-boun...@developer.marklogic.com] On Behalf Of Chris Maloney Sent: Friday, May 21, 2010 2:54 PM To: General Mark Logic Developer Discussion Subject: Re: [MarkLogic Dev General] Regular expression bug Work-around: I discovered that if I enclosed the first part of the expression in parens, it works. E.g. replace($s, "^(a*?)b+", "x"): "xcc" On Fri, May 21, 2010 at 2:56 PM, Maloney, Christopher (NIH/NLM/NCBI) [C] <malon...@ncbi.nlm.nih.gov<mailto:malon...@ncbi.nlm.nih.gov>> wrote: xquery version "1.0"; let $s := "aabbcc" return <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>regex-anomaly.xqy</title> </head> <body> <h1>regex-anomaly.xqy</h1> <p> Demonstrate a MarkLogic regular expression bug. </p> <p> Test string is "{$s}" </p> <p> <b>Works correctly:</b> match as many "a"s as you can (greedily), then one or more "b", and replace with an "x", we expect "xcc": </p> <blockquote> replace($s, "^a*b+", "x"): "{replace($s, "^a*b+", "x")}" </blockquote> <p> <b>Fails:</b> note how the question mark turns off greedy matching for the "+" sign, the result should be the same "xcc", but I am seeing "xbcc": </p> <blockquote> replace($s, "^a*?b+", "x"): "{replace($s, "^a*?b+", "x")}" </blockquote> <p> I tested this same expression using Perl and using Saxon, and they both give the expected results "xcc" in both cases. </p> </body> </html> Chris Maloney NIH/NLM/NCBI (Contractor) Building 45, 5AN.36D-17 301-443-6461 _______________________________________________ General mailing list General@developer.marklogic.com<mailto:General@developer.marklogic.com> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________ General mailing list General@developer.marklogic.com http://developer.marklogic.com/mailman/listinfo/general