Oswaldo Martinez wrote:
> OK before I got in to the loop in the script I decided to try first with one
> file and I have some doubts with the some parts in the script,plus I got an
> error:
>
>
>>>>import re
>>>>file = open("file1.html")
>>>>data = file.read()
>>>>catRe = re.compile(r'<strong>Title:</strong>(.*?)<br><strong>')
Thi regex does not agree with the data you originally posted. Your
original data was
<strong>Category:</strong>Category1<br><br>
Do you see the difference? Your regex has a different ending.
>
>
> # I searched around the docs on regexes I have and found that the "r" #after
> the re.compile(' will detect repeating words.Why is this useful in #my case?
> I want to read the whole string even if it has repeating words. #Also, I
> dont understand the actual regex (.*?) . If I want to match #everything
> inside </strong> and <br><strong> , shouldn`t I just put a "*"
> # ? I tried that and it gave me an error of course.
As Danny said, the r is not part of the regex, it marks a 'raw' string.
In this case it is not needed but I use it always for regex strings out
of habit.
The whole string is the regex, not just the (.*?) part. Most of it just
matches against fixed text. The part in parenthesis says
. match anything
* match 0 or more of the previous character, i.e. 0 or more of anything
? match non-greedy - match the minimum number of characters to make the
whole match succeed. Without this, the .* could match the whole file up
to the *last* <br><strong> which is not what you want!
The parentheses create a group which you can use to pull out the part of
the string which matched inside them. This is the data you want.
>
>
>>>>m = catRe.search(data)
>>>>category = m.group(1)
>
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> AttributeError: 'NoneType' object has no attribute 'group'
In this case the match failed, so m is None and m.group(1) gives an error.
>
>
> I also found that on some of the strings I want to extract, when python
> reads them using file.read(), there are newline characters and other stuff
> that doesn`t show up in the actual html source.Do I have to take these in to
> account in the regex or will it automatically include them?
This will only be a problem if the newlines are in the text you are
actually trying to match.
Kent
_______________________________________________
Tutor maillist - [email protected]
http://mail.python.org/mailman/listinfo/tutor