Re: [Tutor] Extracting data from HTML files

Kent Johnson Fri, 30 Dec 2005 04:59:50 -0800

Oswaldo Martinez wrote:
> OK before I got in to the loop in the script I decided to try first with one
> file and I have some doubts with the some parts in the script,plus I got an
> error:
> 
> 
>>>>import re
>>>>file = open("file1.html")
>>>>data = file.read()
>>>>catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')


Thi regex does not agree with the data you originally posted. Your 
original data was
<strong>Category:</strong>Category1<br><br>

Do you see the difference? Your regex has a different ending.
> 
> 
> # I searched around the docs on regexes I have and found that the "r" #after
> the re.compile(' will detect repeating words.Why is this useful in #my case?
> I want to read the whole string even if it has repeating words.  #Also, I
> dont understand the actual regex (.*?) . If I want to match #everything
> inside </strong> and <br><strong> , shouldn`t I just put a "*"
> # ? I tried that and it  gave me an error of course.

As Danny said, the r is not part of the regex, it marks a 'raw' string. 
In this case it is not needed but I use it always for regex strings out 
of habit.

The whole string is the regex, not just the (.*?) part. Most of it just 
matches against fixed text. The part in parenthesis says
. match anything
* match 0 or more of the previous character, i.e. 0 or more of anything
? match non-greedy - match the minimum number of characters to make the 
whole match succeed. Without this, the .* could match the whole file up 
to the *last* <br><strong> which is not what you want!

The parentheses create a group which you can use to pull out the part of 
the string which matched inside them. This is the data you want.

> 
> 
>>>>m = catRe.search(data)
>>>>category = m.group(1)
> 
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> AttributeError: 'NoneType' object has no attribute 'group'

In this case the match failed, so m is None and m.group(1) gives an error.
> 
> 
> I also found that on some of the strings I want to extract, when python
> reads them using file.read(), there are newline characters and other stuff
> that doesn`t show up in the actual html source.Do I have to take these in to
> account in the regex or will it automatically include them?

This will only be a problem if the newlines are in the text you are 
actually trying to match.

Kent

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Extracting data from HTML files

Reply via email to