Re: [Tutor] RE expressions

2008-08-15 Thread Steve Willoughby

Steve Willoughby wrote:

Johan Nilsson wrote:

In [74]: p.findall('asdsa"123abc\123"jggfds')
Out[74]: ['"123abcS"']


By the way, you're confusing the use of \ in strings in general with the 
use of \ in regular expressions and the appearance of \ as a character 
in data strings encountered by your Python program.


When you write the code:

p.findall('asdsa"123abc\123"jggfds')

the character string 'asdsa"123abc\123"jggfds' contains the special code 
\123 which means "the ASCII character with the octal value 123".  That 
happens to be the letter S.  So that's the same as if you had typed:


p.findall('asdsa"123abcS"jggfds')

which may explain your results.

using a raw string would have solved that problem.
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] RE expressions

2008-08-15 Thread Steve Willoughby

Johan Nilsson wrote:

'text  "http:\123\interesting_adress\etc\etc\" more text'


Does this really use backslashes in the text?  The standard for URLs (if 
that's what it is) is to use forward slashes.


For your RE, though, you can always use [...] to specify a range 
including whatever you like.  Remember that \ is a special symbol, too. 
 If you want to match a literal \ character, the RE for that is \\. 
Also remember to use a raw string in Python so the string-building 
syntax doesn't get confused by the backslashes too.  How about something 
along the lines of:


re.compile(r'"[a-zA-Z0-9_\\]*"')

but why constrain what may be between the quotes?

re.compile(r'"[^"]*"')

or even

re.compile('".*?"')



I have figured out that if it wasn't for the \ a simple
p=re.compile('\"\w+\"') would do the trick. From what I understand \w 
only covers the set [a-zA-Z0-9_] and hence not the "\".
I assume the solution is just in front of my eyes, and I have been 
looking on the screen for too long. Any hints would be appreciated.



In [72]: p=re.compile('"\w+\"')

In [73]: p.findall('asdsa"123abc123"jggfds')
Out[73]: ['"123abc123"']

In [74]: p.findall('asdsa"123abc\123"jggfds')
Out[74]: ['"123abcS"']

/Johan



___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor


[Tutor] RE expressions

2008-08-15 Thread Johan Nilsson

Hi all python experts


I am trying to work with BeautifulSoup and re and running into one problem.

What I want to do is open a webpage and get some information. This is  
working fine
I then want to follow  some of links on this page and proces them. I  
manage to get links that I am interested in filtered out with by simple re  
expressions. My problem is that I now have a number of string that look  
like


'text  "http:\123\interesting_adress\etc\etc\" more text'

I have figured out that if it wasn't for the \ a simple
p=re.compile('\"\w+\"') would do the trick. From what I understand \w only  
covers the set [a-zA-Z0-9_] and hence not the "\".
I assume the solution is just in front of my eyes, and I have been looking  
on the screen for too long. Any hints would be appreciated.



In [72]: p=re.compile('"\w+\"')

In [73]: p.findall('asdsa"123abc123"jggfds')
Out[73]: ['"123abc123"']

In [74]: p.findall('asdsa"123abc\123"jggfds')
Out[74]: ['"123abcS"']

/Johan

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
___
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor