If the format is consistent enough,  you might get away with something like:

>>> p = re.compile('MediaBox \[ ?\d+ \d+ (\d+) (\d+) ?\]')
>>> print p.search(s).groups()
('612', '792')

The important bits being:  ? means "0 or 1 occurences", and you can use parentheses to group matches, and they get put into the tuple returned by the .groups() function. So you can match and extract what you want in one go.

http://www.amk.ca/python/howto/regex/ is a fairly gentle introduction to regular expressions in Python if you want to learn more.

Having said all that, usually you would use a library of some sort to access header information, although I'm not sure what Python has for PDF support, and if that's -all- the information you need, and the -only- variation you'll see, regex probably won't be too bad :)


On 10/9/05, Bill Burns <[EMAIL PROTECTED]> wrote:
I'm looking to get the size (width, length) of a PDF file. Every pdf
file has a 'tag' (in the file) that looks similar to this

Example #1
MediaBox [0 0 612 792]

or this

Example #2
MediaBox [ 0 0 612 792 ]

I figured a regex might be a good way to get this data but the
whitespace (or no whitespace) after the left bracket has me stumped.

If I do this

pattern = re.compile('MediaBox \[\d+ \d+ \d+ \d+')

I can find the MediaBox in Example #1 but I have to do this

pattern = re.compile('MediaBox \[ \d+ \d+ \d+ \d+')

to find it for Example #2.

How can I make *one* regex that will match both cases?

Thanks for the help,

Bill

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

Reply via email to