[Andrew] > If the format is consistent enough, you might get away with something like: > > >>> p = re.compile('MediaBox \[ ?\d+ \d+ (\d+) (\d+) ?\]') > >>> print p.search(s).groups() > ('612', '792') > > The important bits being: ? means "0 or 1 occurences", and you can use > parentheses to group matches, and they get put into the tuple returned > by the .groups() function. So you can match and extract what you want in > one go. > > http://www.amk.ca/python/howto/regex/ is a fairly gentle introduction to > regular expressions in Python if you want to learn more. > > Having said all that, usually you would use a library of some sort to > access header information, although I'm not sure what Python has for PDF > support, and if that's -all- the information you need, and the -only- > variation you'll see, regex probably won't be too bad :) >
Thanks, Andrew! Yes, the format is consistent (I believe the whitespace I mentioned is the only difference you may find). I'll take a look at your use of group matches tonight, looks like a really easy way to return the two numbers I need. Yeah, I was hoping to find a python PDF library that could do this, but things seem a little sparse in this area. The only info I need is the PDF size and it's consistently located (and tagged) in the MediaBox so I figured it was a good way to get the data. Bill _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor