Karim wrote:

*Indeed what's the matter with RE module!?*
You should really fix the problem with your email program first;
Thunderbird issue with bold type (appears as stars) but I don't know how to fix it yet.

A man when to a doctor and said, "Doctor, every time I do this, it hurts. What should I do?"

The doctor replied, "Then stop doing that!"

:)

Don't add bold or any other formatting to things which should be program code. Even if it looks okay in *your* program, you don't know how it will look in other people's programs. If you need to draw attention to something in a line of code, add a comment, or talk about it in the surrounding text.


[...]
That is not the thing I want. I want to escape any " which are not already escaped. The sed regex '/\([^\\]\)\?"/\1\\"/g' is exactly what I need (I have made regex on unix since 15 years).

Which regex? Perl regexes? sed or awk regexes? Extended regexes? GNU posix compliant regexes? grep or egrep regexes? They're all different.

In any case, I am sorry, I don't think your regex does what you say. When I try it, it doesn't work for me.

[steve@sylar ~]$ echo 'Some \"text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \\"text\"

I wouldn't expect it to work. See below.

By the way, you don't need to escape the brackets or the question mark:

[steve@sylar ~]$ echo 'Some \"text"' | sed -re 's/([^\\])?"/\1\\"/g'
Some \\"text\"


For me the equivalent python regex is buggy: r'([^\\])?"', r'\1\\"'

No it is not.

The pattern you are matching does not do what you think it does. "Zero or one of not-backslash, followed by a quote" will match a single quote *regardless* of what is before it. This is true even in sed, as you can see above, your sed regex matches both quotes.

\" will match, because the regular expression will match zero characters, followed by a quote. So the regex is correct.

>>> match = r'[^\\]?"'  # zero or one not-backslash followed by quote
>>> re.search(match, r'aaa\"aaa').group()
'"'

Now watch what happens when you call re.sub:


>>> match = r'([^\\])?"'  # group 1 equals a single non-backslash
>>> replace = r'\1\\"'  # group 1 followed by \ followed by "
>>> re.sub(match, replace, 'aaaa')  # no matches
'aaaa'
>>> re.sub(match, replace, 'aa"aa')  # one match
'aa\\"aa'
>>> re.sub(match, replace, '"aaaa')  # one match, but there's no group 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.1/re.py", line 166, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/local/lib/python3.1/re.py", line 303, in filter
    return sre_parse.expand_template(template, match)
File "/usr/local/lib/python3.1/sre_parse.py", line 807, in expand_template
    raise error("unmatched group")
sre_constants.error: unmatched group

Because group 1 was never matched, Python's re.sub raised an error. It is not a very informative error, but it is valid behaviour.

If I try the same thing in sed, I get something different:

[steve@sylar ~]$ echo '"Some text' | sed -re 's/([^\\])?"/\1\\"/g'
\"Some text

It looks like this version of sed defines backreferences on the right-hand side to be the empty string, in the case that they don't match at all. But this is not standard behaviour. The sed FAQs say that this behaviour will depend on the version of sed you are using:

"Seds differ in how they treat invalid backreferences where no corresponding group occurs."

http://sed.sourceforge.net/sedfaq3.html

So you can't rely on this feature. If it works for you, great, but it may not work for other people.


When you delete the ? from the Python regex, group 1 is always valid, and you don't get an exception. Or if you ensure the input always matches group 1, no exception:

>>> match = r'([^\\])?"'
>>> replace = r'\1\\"'
>>> re.sub(match, replace, 'a"a"a"a') # group 1 always matches
'a\\"a\\"a\\"a'

(It still won't do what you want, but that's a *different* problem.)



Jamie Zawinski wrote:

  Some people, when confronted with a problem, think "I know,
  I'll use regular expressions." Now they have two problems.

How many hours have you spent trying to solve this problem using regexes? This is a *tiny* problem that requires an easy solution, not wrestling with a programming language that looks like line-noise.

This should do what you ask for:

def escape(text):
    """Escape any double-quote characters if and only if they
    aren't already escaped."""
    output = []
    escaped = False
    for c in text:
        if c == '"' and not escaped:
            output.append('\\')
        elif c == '\\':
            output.append('\\')
            escaped = True
            continue
        output.append(c)
        escaped = False
    return ''.join(output)


Armed with this helper function, which took me two minutes to write, I can do this:

>>> text = 'Some text with backslash-quotes \\" and plain quotes " together.'
>>> print escape(text)
Some text with backslash-quotes \" and plain quotes \" together.


Most problems that people turn to regexes are best solved without regexes. Even Larry Wall, inventor of Perl, is dissatisfied with regex culture and syntax:

http://dev.perl.org/perl6/doc/design/apo/A05.html



--
Steven
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to