Re: [Tutor] RE module is working ?

Steven D'Aprano Thu, 03 Feb 2011 17:38:31 -0800

Karim wrote:

*Indeed what's the matter with RE module!?*
You should really fix the problem with your email program first;
Thunderbird issue with bold type (appears as stars) but I don't know howto fix it yet.

A man when to a doctor and said, "Doctor, every time I do this, ithurts. What should I do?"


The doctor replied, "Then stop doing that!"

:)

Don't add bold or any other formatting to things which should be programcode. Even if it looks okay in *your* program, you don't know how itwill look in other people's programs. If you need to draw attention tosomething in a line of code, add a comment, or talk about it in thesurrounding text.



[...]

That is not the thing I want. I want to escape any " which are notalready escaped.The sed regex '/\([^\\]\)\?"/\1\\"/g' is exactly what I need (I havemade regex on unix since 15 years).

Which regex? Perl regexes? sed or awk regexes? Extended regexes? GNUposix compliant regexes? grep or egrep regexes? They're all different.

In any case, I am sorry, I don't think your regex does what you say.When I try it, it doesn't work for me.


[steve@sylar ~]$ echo 'Some \"text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \\"text\"

I wouldn't expect it to work. See below.

By the way, you don't need to escape the brackets or the question mark:

[steve@sylar ~]$ echo 'Some \"text"' | sed -re 's/([^\\])?"/\1\\"/g'
Some \\"text\"

For me the equivalent python regex is buggy: r'([^\\])?"', r'\1\\"'


No it is not.

The pattern you are matching does not do what you think it does. "Zeroor one of not-backslash, followed by a quote" will match a single quote*regardless* of what is before it. This is true even in sed, as you cansee above, your sed regex matches both quotes.

\" will match, because the regular expression will match zerocharacters, followed by a quote. So the regex is correct.


>>> match = r'[^\\]?"'  # zero or one not-backslash followed by quote
>>> re.search(match, r'aaa\"aaa').group()
'"'

Now watch what happens when you call re.sub:


>>> match = r'([^\\])?"'  # group 1 equals a single non-backslash
>>> replace = r'\1\\"'  # group 1 followed by \ followed by "
>>> re.sub(match, replace, 'aaaa')  # no matches
'aaaa'
>>> re.sub(match, replace, 'aa"aa')  # one match
'aa\\"aa'
>>> re.sub(match, replace, '"aaaa')  # one match, but there's no group 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.1/re.py", line 166, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/local/lib/python3.1/re.py", line 303, in filter
    return sre_parse.expand_template(template, match)

File "/usr/local/lib/python3.1/sre_parse.py", line 807, inexpand_template

    raise error("unmatched group")
sre_constants.error: unmatched group

Because group 1 was never matched, Python's re.sub raised an error. Itis not a very informative error, but it is valid behaviour.


If I try the same thing in sed, I get something different:

[steve@sylar ~]$ echo '"Some text' | sed -re 's/([^\\])?"/\1\\"/g'
\"Some text

It looks like this version of sed defines backreferences on theright-hand side to be the empty string, in the case that they don'tmatch at all. But this is not standard behaviour. The sed FAQs say thatthis behaviour will depend on the version of sed you are using:

"Seds differ in how they treat invalid backreferences where nocorresponding group occurs."


http://sed.sourceforge.net/sedfaq3.html

So you can't rely on this feature. If it works for you, great, but itmay not work for other people.

When you delete the ? from the Python regex, group 1 is always valid,and you don't get an exception. Or if you ensure the input alwaysmatches group 1, no exception:

>>> match = r'([^\\])?"'
>>> replace = r'\1\\"'
>>> re.sub(match, replace, 'a"a"a"a') # group 1 always matches
'a\\"a\\"a\\"a'

(It still won't do what you want, but that's a *different* problem.)

Jamie Zawinski wrote:

  Some people, when confronted with a problem, think "I know,
  I'll use regular expressions." Now they have two problems.

How many hours have you spent trying to solve this problem usingregexes? This is a *tiny* problem that requires an easy solution, notwrestling with a programming language that looks like line-noise.


This should do what you ask for:

def escape(text):
    """Escape any double-quote characters if and only if they
    aren't already escaped."""
    output = []
    escaped = False
    for c in text:
        if c == '"' and not escaped:
            output.append('\\')
        elif c == '\\':
            output.append('\\')
            escaped = True
            continue
        output.append(c)
        escaped = False
    return ''.join(output)

Armed with this helper function, which took me two minutes to write, Ican do this:

>>> text = 'Some text with backslash-quotes \\" and plain quotes "together.'

>>> print escape(text)
Some text with backslash-quotes \" and plain quotes \" together.

Most problems that people turn to regexes are best solved withoutregexes. Even Larry Wall, inventor of Perl, is dissatisfied with regexculture and syntax:


http://dev.perl.org/perl6/doc/design/apo/A05.html



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] RE module is working ?

Reply via email to