On 02/04/2011 02:36 AM, Steven D'Aprano wrote:
Karim wrote:

*Indeed what's the matter with RE module!?*
You should really fix the problem with your email program first;
Thunderbird issue with bold type (appears as stars) but I don't know how to fix it yet.

A man when to a doctor and said, "Doctor, every time I do this, it hurts. What should I do?"

The doctor replied, "Then stop doing that!"

:)

Yes this these words made me laugh. I will keep it in my funny box.



Don't add bold or any other formatting to things which should be program code. Even if it looks okay in *your* program, you don't know how it will look in other people's programs. If you need to draw attention to something in a line of code, add a comment, or talk about it in the surrounding text.


[...]
That is not the thing I want. I want to escape any " which are not already escaped. The sed regex '/\([^\\]\)\?"/\1\\"/g' is exactly what I need (I have made regex on unix since 15 years).

Mainly sed, awk and perl sometimes grep and egrep. I know this is the jungle.

Which regex? Perl regexes? sed or awk regexes? Extended regexes? GNU posix compliant regexes? grep or egrep regexes? They're all different.

In any case, I am sorry, I don't think your regex does what you say. When I try it, it doesn't work for me.

[steve@sylar ~]$ echo 'Some \"text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \\"text\"

I give you my word on this. Exact output I redid it:

#MY OS VERSION
karim@Requiem4Dream:~$ uname -a
Linux Requiem4Dream 2.6.32-28-generic #55-Ubuntu SMP Mon Jan 10 23:42:43 UTC 2011 x86_64 GNU/Linux
#MY SED VERSION
karim@Requiem4Dream:~$ sed --version
GNU sed version 4.2.1
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
to the extent permitted by law.

GNU sed home page: <http://www.gnu.org/software/sed/>.
General help using GNU software: <http://www.gnu.org/gethelp/>.
E-mail bug reports to: <bug-gnu-ut...@gnu.org>.
Be sure to include the word ``sed'' somewhere in the ``Subject:'' field.
#MY SED OUTPUT COMMAND:
karim@Requiem4Dream:~$  echo 'Some ""' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \"\"
# THIS IS WHAT I WANT 2 CONSECUTIVES IF THE FIRST ONE IS ALREADY ESCAPED I DON'T WANT TO ESCAPED IT TWICE.
karim@Requiem4Dream:~$ echo 'Some \""' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \"\"
# BY THE WAY THIS ONE WORKS:
karim@Requiem4Dream:~$ echo 'Some "text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \"text\"
# BUT SURE NOT THIS ONE NOT COVERED BY MY REGEX (I KNOW IT AND WANT ORIGINALY TO COVER IT): karim@Requiem4Dream:~$ echo 'Some \"text"' | sed -e 's/\([^\\]\)\?"/\1\\"/g'
Some \\"text\"

By the way in all sed version I work with the '?' (0 or one match) should be escaped that's the reason I have '\?' same thing with save '\(' and '\)' to store value. In perl, grep you don't need to escape.

# SAMPLE FROM http://www.gnu.org/software/sed/manual/sed.html

|\+|
   same As |*|, but matches one or more. It is a GNU extension.
|\?|
   same As |*|, but only matches zero or one. It is a GNU extension

I wouldn't expect it to work. See below.

By the way, you don't need to escape the brackets or the question mark:

[steve@sylar ~]$ echo 'Some \"text"' | sed -re 's/([^\\])?"/\1\\"/g'
Some \\"text\"


For me the equivalent python regex is buggy: r'([^\\])?"', r'\1\\"'

No it is not.


Yes I know, see my latest post in detail I already found the solution. I put it again the solution below:

#Found the solution: '?' needs to be inside parenthesis (saved pattern) because outside we don't know if the saved match argument
#will exist or not namely '\1'.

>>> re.subn(r'([^\\]?)"', r'\1\\"', expression)

(' \\"\\" ', 2)


The pattern you are matching does not do what you think it does. "Zero or one of not-backslash, followed by a quote" will match a single quote *regardless* of what is before it. This is true even in sed, as you can see above, your sed regex matches both quotes.

\" will match, because the regular expression will match zero characters, followed by a quote. So the regex is correct.

>>> match = r'[^\\]?"'  # zero or one not-backslash followed by quote
>>> re.search(match, r'aaa\"aaa').group()
'"'

Now watch what happens when you call re.sub:


>>> match = r'([^\\])?"'  # group 1 equals a single non-backslash
>>> replace = r'\1\\"'  # group 1 followed by \ followed by "
>>> re.sub(match, replace, 'aaaa')  # no matches
'aaaa'
>>> re.sub(match, replace, 'aa"aa')  # one match
'aa\\"aa'
>>> re.sub(match, replace, '"aaaa')  # one match, but there's no group 1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.1/re.py", line 166, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/usr/local/lib/python3.1/re.py", line 303, in filter
    return sre_parse.expand_template(template, match)
File "/usr/local/lib/python3.1/sre_parse.py", line 807, in expand_template
    raise error("unmatched group")
sre_constants.error: unmatched group

Because group 1 was never matched, Python's re.sub raised an error. It is not a very informative error, but it is valid behaviour.

If I try the same thing in sed, I get something different:

[steve@sylar ~]$ echo '"Some text' | sed -re 's/([^\\])?"/\1\\"/g'
\"Some text

It looks like this version of sed defines backreferences on the right-hand side to be the empty string, in the case that they don't match at all. But this is not standard behaviour. The sed FAQs say that this behaviour will depend on the version of sed you are using:

"Seds differ in how they treat invalid backreferences where no corresponding group occurs."

http://sed.sourceforge.net/sedfaq3.html

So you can't rely on this feature. If it works for you, great, but it may not work for other people.


When you delete the ? from the Python regex, group 1 is always valid, and you don't get an exception. Or if you ensure the input always matches group 1, no exception:

>>> match = r'([^\\])?"'
>>> replace = r'\1\\"'
>>> re.sub(match, replace, 'a"a"a"a') # group 1 always matches
'a\\"a\\"a\\"a'

(It still won't do what you want, but that's a *different* problem.)



Jamie Zawinski wrote:

  Some people, when confronted with a problem, think "I know,
  I'll use regular expressions." Now they have two problems.

How many hours have you spent trying to solve this problem using regexes? This is a *tiny* problem that requires an easy solution, not wrestling with a programming language that looks like line-noise.

This should do what you ask for:

def escape(text):
    """Escape any double-quote characters if and only if they
    aren't already escaped."""
    output = []
    escaped = False
    for c in text:
        if c == '"' and not escaped:
            output.append('\\')
        elif c == '\\':
            output.append('\\')
            escaped = True
            continue
        output.append(c)
        escaped = False
    return ''.join(output)


Thank you for this one! This gives me some inspiration for other more complicated parsing. :-)



Armed with this helper function, which took me two minutes to write, I can do this:

>>> text = 'Some text with backslash-quotes \\" and plain quotes " together.'
>>> print escape(text)
Some text with backslash-quotes \" and plain quotes \" together.


Most problems that people turn to regexes are best solved without regexes. Even Larry Wall, inventor of Perl, is dissatisfied with regex culture and syntax:

http://dev.perl.org/perl6/doc/design/apo/A05.html

Ok but if I have to suppress all use of my one-liner sed regex most used utilities this is like refusing to use my car to go to work
and make 20km by feet.
For overuse I can understand that though I already did 30 lines of pure sed script using all it features
which would have taken much more lines with awk or perl language.

Anyway I am inclined to python now so if a re module exists with my small regex there is no big deal to become familiar with this module.

Thanks for your efforts you've done.

Regards
Karim

_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor

Reply via email to