If the format is consistent enough, you might get away with something like:
>>> p = re.compile('MediaBox \[ ?\d+ \d+ (\d+) (\d+) ?\]')
>>> print p.search(s).groups()
('612', '792')
The important bits being: ? means "0 or 1 occurences", and you can use parentheses to group matches, and they get put into the tuple returned by the .groups() function. So you can match and extract what you want in one go.
http://www.amk.ca/python/howto/regex/ is a fairly gentle introduction to regular expressions in Python if you want to learn more.
Having said all that, usually you would use a library of some sort to access header information, although I'm not sure what Python has for PDF support, and if that's -all- the information you need, and the -only- variation you'll see, regex probably won't be too bad :)
On 10/9/05, Bill Burns <[EMAIL PROTECTED]> wrote:
I'm looking to get the size (width, length) of a PDF file. Every pdf
file has a 'tag' (in the file) that looks similar to this
Example #1
MediaBox [0 0 612 792]
or this
Example #2
MediaBox [ 0 0 612 792 ]
I figured a regex might be a good way to get this data but the
whitespace (or no whitespace) after the left bracket has me stumped.
If I do this
pattern = re.compile('MediaBox \[\d+ \d+ \d+ \d+')
I can find the MediaBox in Example #1 but I have to do this
pattern = re.compile('MediaBox \[ \d+ \d+ \d+ \d+')
to find it for Example #2.
How can I make *one* regex that will match both cases?
Thanks for the help,
Bill
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor