Re: match point

2015-12-22 Thread Chris Angelico
On Tue, Dec 22, 2015 at 9:56 PM, Thierry  wrote:
> Maybe re.match has an implementation that makes it more efficient? But
> then why would I ever use r'\A', since that anchor makes a pattern match
> in only a single position, and is therefore useless in functions like
> re.findall, re.finditer or re.split?

Much of the value of regular expressions is that they are NOT string
literals (just strings). Effectively, someone who has no authority to
change the code of the program can cause it to change from re.search
to re.match, simply by putting \A at the beginning of the search
string.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: match point

2015-12-22 Thread Thomas 'PointedEars' Lahn
Thierry wrote:

> Reading the docs about regular expressions, I am under the impression
> that calling
> re.match(pattern, string)
> is exactly the same as
> re.search(r'\A'+pattern, string)

Correct.
 
> Same for fullmatch, that amounts to
> re.search(r'\A'+pattern+r'\Z', string)

Correct.
 
> The docs devote a chapter to "6.2.5.3. search() vs. match()", but they
> only discuss how match() is different from search() with '^', completely
> eluding the case of search() with r'\A'.
> 
> At first I thought those functions could have been introduced at a time
> when r'\A' and r'\Z' did not exist, but then I noticed that re.fullmatch
> is a recent addition (python 3.4)
> 
> Maybe re.match has an implementation that makes it more efficient? But
> then why would I ever use r'\A', since that anchor makes a pattern match
> in only a single position, and is therefore useless in functions like
> re.findall, re.finditer or re.split?

(Thank you for pointing out “\A” and “\Z”; this strongly suggests that even 
in raw mode you should always match literal “\” with the regular expression 
“\\”, or IOW that you should always use re.escape() when constructing 
regular expressions from arbitrary strings for matching WinDOS/UNC paths, 
for example.)

If you would use

  re.search(r'\Afoo.*^bar$.*baz\Z', string, flags=re.DOTALL | re.MULTILINE)

you could match only strings that start with “foo”, have a line following 
that which contains only “bar”, and end with “baz”.  (In multi-line mode, 
the meaning of “^” and “$” change to start-of-line and end-of-line, 
respectively.)

Presumably, re.fullmatch() was introduced in Python 3.4 so that you can 
write

  re.fullmatch(r'foo.*^bar$.*baz', string, flags=re.DOTALL | re.MULTILINE)

instead, since you are not actually searching, and would make sure that you 
*always* want to match against the whole string, regardless of the 
expression.

| Note that even in MULTILINE mode, re.match() will only match at the 
| beginning of the string and not at the beginning of each line.

and that

| re.search(pattern, string, flags=0)
|   Scan through string looking for the first location where the regular 
|   expression pattern produces a match […]

So with both re.search() and re.fullmatch(), you are more flexible should 
the expression be dynamically constructed: you can always use re.search().



Please add your last name, Thierry #1701.

-- 
PointedEars

Twitter: @PointedEars2
Please do not cc me. / Bitte keine Kopien per E-Mail.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: match point

2015-12-22 Thread Thierry Closen

I found the story behind the creation of re.fullmatch(). 

I had no luck before because I was searching under "www.python.org/dev",
while in reality it sprang out of a bug report:
https://bugs.python.org/issue16203

In summary, there were repeated bugs where during maintenance of code
the $ symbol disappeared from patterns, hence the decision to create a
function that anchors the pattern to the end of the string independently
of the presence of that symbol.

I am perplexed by what I discovered, as I would never have thought that
such prominent functions can be created to scratch such a minor itch:
The creation of fullmatch() might address this very specific issue, but 
I would tend to think that if really certain symbols disappear from
patterns inside a code base, this should be seen as the sign of more
profound problems in the code maintenance processes.

Anyway, the discussion around that bug inspired me another argument that
is more satisfying:

When I was saying that
re.fullmatch(pattern, string)
is exactly the same as
re.search(r'\A'+pattern+r'\Z', string)
I was wrong.

For example if pattern starts with an inline flag like (?i), we cannot
simply stick \A in front of it.

Other example, consider pattern is 'a|b'. We end up with:
re.search(r'\Aa|b\Z', string)
which is not what we want.

To avoid that problem we need to add parentheses:
re.search(r'\A('+pattern+r')\Z', string)
But now we created a group, and if the pattern already contained groups
and backreferences we may just have broken it.

So we need to use a non-capturing group:
re.search(r'\A(?:'+pattern+r')\Z', string)
...and now I think we can say we are at a level of complexity where we
cannot reasonably expect the average user to always remember to write
exactly this, so it makes sense to add an easy-to-use fullmatch function
to the re namespace.

It may not be the real historical reason behind re.fullmatch, but
personally I will stick with that one :)

Cheers,

Thierry


-- 
https://mail.python.org/mailman/listinfo/python-list