Re: Questions about regex
wrote in news:fe9f707f-aaf3-4ca6-859a-5b0c63904fc0 @s28g2000vbp.googlegroups.com in comp.lang.python: > text = re.sub('(\<(/?[^\>]+)\>)', "", text)#remove the HTML > Python has a /r/ (raw) string literal type for regex's: text = re.sub( r'(\<(/?[^\>]+)\>)', "", text ) In raw strings python doesn't process backslash escape sequences so r\n' is the 2 char' string '\\n' (a backslash folowed by an 'n'). Without that your pattern string would need to be writen as: '(\\<(/?[^\\>]+)\\>)' IOW backslashes need to be doubled up or python will process them before they are passed to re.sub. Also this seems to be some non-python dialect of regular expression language, Pythons re's don't need to escape < and >. http://docs.python.org/library/re.html The grouping operators, '(' and ')', appear to be unnessasery, so altogether this 1 line should probably be: text = re.sub( r']+>', '', text ) Rob. -- http://www.victim-prime.dsl.pipex.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions about regex
On Fri, 29 May 2009 11:26:07 -0700, Jared.S.Bauer wrote: > Hello, > > I'm new to python and I'm having problems with a regular expression. I > use textmate as my editor and when I run the regex in textmate it works > fine, but when I run it as part of the script it freezes. Could anyone > help me figure out why this is happening and how to fix it. Sure. To figure out why it is happening, the first thing you must do is figure out *what* is happening. So first you have to isolate the fault: what part of your script is freezing? I'm going to assume that it is the regex: > #The two following lines are the ones giving me the problems > text = re.sub("w:(.|\s)*?\n", "", text) > text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text) What happens when you call those two lines in isolation, away from the rest of your script? (Obviously you need to initialise a value for text.) Do they still freeze? For example, I can do this: >>> text = "Nobodyw: \n expects the Spanish Inquisition!" >>> text = re.sub("w:(.|\s)*?\n", "", text) >>> text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text) >>> text 'Nobody expects the Spanish Inquisition!' and it doesn't freeze. It works fine. I suspect that your problem is that the regex hasn't actually *frozen*, it's just taking a very, very long time to complete. My guess is that it probably has something to do with: (.|\s)*? This says, "Match any number of, but as few as possible, of any character or whitespace". This will match newlines as well, so the regular expression engine will need to do backtracking, which means it will be slow for large amounts of data. You want to reduce the amount of backtracking that's needed! I *guess* that what you probably want is: w:.*?\n which will match the letter 'w' followed by ':' followed by the shortest number of arbitrary characters, including spaces *but not newlines*, followed by a newline. The second regex will probably need a similar change made. But don't take my word for it: I'm not a regex expert. But isolate the fault, identify when it is happening (for all input data, or only for large amounts of data?), and then you have a shot at fixing it. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions about regex
Jared.S., even if a regex doesn't look like a program, it's like a small program written in a strange language. And you have to test and comment your programs. So I suggest you to program in a more tidy way, and add unit tests (doctests may suffice here) to your regexes, you can also use the verbose mode and comment them, and you can even indent their sub-parts as pieces of a program. You must test all your bricks (in python, not in TextMate) before using them to build something bigger. Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: Questions about regex
On May 29, 1:26 pm, jared.s.ba...@gmail.com wrote: > Hello, > > I'm new to python and I'm having problems with a regular expression. I > use textmate as my editor and when I run the regex in textmate it > works fine, but when I run it as part of the script it freezes. Could > anyone help me figure out why this is happening and how to fix it. > Here is the script: > > == > # regular expression search and replace > import sys, os, re, string, csv > > #Open the file and taking its data > myfile=open('Steve_query3.csv') #Steve_query_test.csv > #create an error flag to loop the script twice > #store all file's data in the string object 'text' > myfile.seek(0) > text = myfile.read() > > for i in range(2): > #def textParse(text, reRun): > print 'how many times is this getting executed', i > > #Now to create the newfile 'test' and write our 'text' > newfile = open('Steve_query3_out.csv', 'w') > #open the new file and set it with 'w' for "write" > #loop trough 'text' clean them up and write them into the 'newfile' > #sub( pattern, repl, string[, count]) > #"sub("(?i)b+", "x", " ")" returns 'x x'. > text = re.sub('(\<(/?[^\>]+)\>)', "", text)#remove the HTML > text = re.sub('//', "", text) #remove comments > text = re.sub('\/\*(.|\s)*?;}', "", text) #remove css formatting > #remove a bunch of word formatting yuck > text = re.sub(" ", " ", text) > text = re.sub("<", "<", text) > text = re.sub(">", ">", text) > text = re.sub(""|&rquot;|“", "\'", text) > #=== > #The two following lines are the ones giving me the problems > text = re.sub("w:(.|\s)*?\n", "", text) > text = re.sub("UnhideWhenUsed=(.|\s)*?\n", "", text) > #=== > text = re.sub(re.compile('^\r?\n?$', re.MULTILINE), '', text) #remove > the extra whitespace > #now write out the new file and close it > newfile.write(text) > newfile.close() > > #open the newfile and run the script again > #Open the file and taking its data > > myfile=open('Steve_query3_out.csv') #Steve_query_test.csv > #store all file's data in the string object 'text' > myfile.seek(0) > text = myfile.read() > > Thanks for the help, > > -Jared Can you give a string that you would expect the regex to match and what the expected result would be? Currently, it looks like the interesting part of the regex (.|\s)*? would match any character of any length once. There seems to be some redundancy that makes it more confusing then it needs to be. I'm pretty sure that . will also match anything that \s will match or maybe you just need to escape . because you meant for it to be a literal. -- http://mail.python.org/mailman/listinfo/python-list