Re: Questions about regex

2009-05-30 Thread Rob Williscroft
 wrote in news:fe9f707f-aaf3-4ca6-859a-5b0c63904fc0
@s28g2000vbp.googlegroups.com in comp.lang.python:


>  text = re.sub('(\<(/?[^\>]+)\>)', "", text)#remove the HTML
> 

Python has a /r/ (raw) string literal type for regex's:

  text = re.sub( r'(\<(/?[^\>]+)\>)', "", text )

In raw strings python doesn't process backslash escape sequences
so r\n' is the 2 char' string '\\n' (a backslash folowed by an 'n').

Without that your pattern  string would need to be writen as:

  '(\\<(/?[^\\>]+)\\>)'

IOW backslashes need to be doubled up or python will process them
before they are passed to re.sub.

Also this seems to be some non-python dialect of regular expression
language, Pythons re's don't need to escape < and >.

http://docs.python.org/library/re.html

The grouping operators, '(' and ')', appear to be unnessasery,
so altogether this 1 line should probably be:

  text = re.sub( r']+>', '', text )

Rob.
-- 
http://www.victim-prime.dsl.pipex.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Questions about regex

2009-05-30 Thread Steven D'Aprano
On Fri, 29 May 2009 11:26:07 -0700, Jared.S.Bauer wrote:

> Hello,
> 
> I'm new to python and I'm having problems with a regular expression. I
> use textmate as my editor and when I run the regex in textmate it works
> fine, but when I run it as part of the script it freezes. Could anyone
> help me figure out why this is happening and how to fix it.


Sure. To figure out why it is happening, the first thing you must do is 
figure out *what* is happening. So first you have to isolate the fault: 
what part of your script is freezing?

I'm going to assume that it is the regex:

> #The two following lines are the ones giving me the problems
>   text = re.sub("w:(.|\s)*?\n", "", text) 
>   text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)

What happens when you call those two lines in isolation, away from the 
rest of your script? (Obviously you need to initialise a value for text.)
Do they still freeze?

For example, I can do this:

>>> text = "Nobodyw: \n expects the Spanish Inquisition!"
>>> text = re.sub("w:(.|\s)*?\n", "", text)
>>> text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)
>>> text
'Nobody expects the Spanish Inquisition!'

and it doesn't freeze. It works fine.

I suspect that your problem is that the regex hasn't actually *frozen*, 
it's just taking a very, very long time to complete. My guess is that it 
probably has something to do with:

(.|\s)*?

This says, "Match any number of, but as few as possible, of any character 
or whitespace". This will match newlines as well, so the regular 
expression engine will need to do backtracking, which means it will be 
slow for large amounts of data. You want to reduce the amount of 
backtracking that's needed!

I *guess* that what you probably want is:

w:.*?\n

which will match the letter 'w' followed by ':' followed by the shortest 
number of arbitrary characters, including spaces *but not newlines*, 
followed by a newline.

The second regex will probably need a similar change made.

But don't take my word for it: I'm not a regex expert. But isolate the 
fault, identify when it is happening (for all input data, or only for 
large amounts of data?), and then you have a shot at fixing it.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Questions about regex

2009-05-30 Thread bearophileHUGS
Jared.S., even if a regex doesn't look like a program, it's like a
small program written in a strange language. And you have to test and
comment your programs.
So I suggest you to program in a more tidy way, and add unit tests
(doctests may suffice here) to your regexes, you can also use the
verbose mode and comment them, and you can even indent their sub-parts
as pieces of a program.
You must test all your bricks (in python, not in TextMate) before
using them to build something bigger.

Bye,
bearophile
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Questions about regex

2009-05-29 Thread Bobby
On May 29, 1:26 pm, jared.s.ba...@gmail.com wrote:
> Hello,
>
> I'm new to python and I'm having problems with a regular expression. I
> use textmate as my editor and when I run the regex in textmate it
> works fine, but when I run it as part of the script it freezes. Could
> anyone help me figure out why this is happening and how to fix it.
> Here is the script:
>
> ==
> # regular expression search and replace
> import sys, os, re, string, csv
>
> #Open the file and taking its data
> myfile=open('Steve_query3.csv') #Steve_query_test.csv
> #create an error flag  to loop the script twice
> #store all file's data in the string object 'text'
> myfile.seek(0)
> text = myfile.read()
>
> for i in range(2):
>         #def textParse(text, reRun):
>         print 'how many times is this getting executed', i
>
>         #Now to create the newfile 'test' and write our 'text'
>         newfile = open('Steve_query3_out.csv', 'w')
>         #open the new file and set it with 'w' for "write"
>         #loop trough 'text' clean them up and write them into the 'newfile'
>                         #sub(   pattern, repl, string[, count])
>                         #"sub("(?i)b+", "x", " ")" returns 'x x'.
>         text = re.sub('(\<(/?[^\>]+)\>)', "", text)#remove the HTML
>         text = re.sub('//', "", text) #remove comments  
>         text = re.sub('\/\*(.|\s)*?;}', "", text) #remove css formatting
>         #remove a bunch of word formatting yuck
>         text = re.sub(" ", " ", text)
>         text = re.sub("<", "<", text)
>         text = re.sub(">", ">", text)
>         text = re.sub(""|&rquot;|“", "\'", text)
> #===
> #The two following lines are the ones giving me the problems
>         text = re.sub("w:(.|\s)*?\n", "", text)
>         text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text)
> #===
>         text = re.sub(re.compile('^\r?\n?$', re.MULTILINE), '', text) #remove
> the extra whitespace
>         #now write out the new file and close it
>         newfile.write(text)
>         newfile.close()
>
>         #open the newfile and run the script again
>         #Open the file and taking its data
>
>         myfile=open('Steve_query3_out.csv') #Steve_query_test.csv
>         #store all file's data in the string object 'text'
>         myfile.seek(0)
>         text = myfile.read()
>
> Thanks for the help,
>
> -Jared

Can you give a string that you would expect the regex to match and
what the expected result would be? Currently, it looks like the
interesting part of the regex (.|\s)*? would match any character of
any length once. There seems to be some redundancy that makes it more
confusing then it needs to be. I'm pretty sure that . will also match
anything that \s will match or maybe you just need to escape . because
you meant for it to be a literal.
-- 
http://mail.python.org/mailman/listinfo/python-list