Wow! The people producing these PDFs need a SERIOUS lesson in proper PDF production...These files are wasting A LOT of space because of what they are doing....
There is no magic here - just stupidity. Each of the spreads is DUPLICATED in the PDF - and then cropped (CropBox != MediaBox) to the right or left accordingly. That's why it renders as single pages, because that's what is defined as the viewable area. Apparently the commands you are using with ImageMagick aren't respecting that cropbox. Start by making sure you are current with IM and also Ghostscript. Leonard -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Michael Howard Sent: Wednesday, February 02, 2011 1:08 PM To: [email protected] Subject: [poppler] page sequence + page spreads in PDF My questions are intended for the poppler / PDF gurus. They aren't really poppler questions, but relate to the sequence of pages and page spreads in PDF files. I have googled and have read through the PDF reference 1.4, but haven't found anything to answer my questions. BACKGROUND I have a relatively large set of PDF files for magazines. These are PDF files that were sent to the print shop for printing. We want to extract .jpg images of the pages and the text on the pages. I am using ImageMagick convert (Ghostcript) to generate the images and poppler pdftotext to extract the text. Most of the pages of the magazines are (more-or-less) 8.5x11 portrait pages. However, some pages are 11x17 landscape "spreads" of two facing pages. In some cases, the outside covers and inside covers are in 2-page "spreads". That is, the front cover + spine + rear cover are all on a single PDF page. This is understandable since this is the way that the paper magazines were printed. SAMPLE FILES In the file http://cdn.uforlife.com/public/TLN200806.pdf the first two "pages" are spreads of the outside covers and inside covers [mth@localhost ~]$ identify TLN200806.pdf | head TLN200806.pdf[0] PDF 1214x828 1214x828+0+0 16-bit Bilevel DirectClass 4.03MB 0.220u 0:00.210 TLN200806.pdf[1] PDF 1214x828 1214x828+0+0 16-bit Bilevel DirectClass 4.03MB 0.210u 0:00.210 TLN200806.pdf[2] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass 4.03MB 0.200u 0:00.199 TLN200806.pdf[3] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass 4.03MB 0.200u 0:00.199 In the file http://cdn.uforlife.com/public/TLN200812.pdf only the first "page" is a spread of the outside covers [mth@localhost ~]$ identify TLN200812.pdf | head TLN200812.pdf[0] PDF 1214x828 1214x828+0+0 16-bit Bilevel DirectClass 4.657MB 0.250u 0:00.250 TLN200812.pdf[1] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass 4.657MB 0.240u 0:00.240 TLN200812.pdf[2] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass 4.657MB 0.240u 0:00.240 TLN200812.pdf[3] PDF 603x828 603x828+0+0 16-bit Bilevel DirectClass 4.657MB 0.240u 0:00.240 COVER SPREAD / PAGE SEQUENCE QUESTION Given files with the back cover & front cover on facing spreads, I have observed that both Acrobat Reader and Evince properly split the spreads at the beginning and end. So, when one looks in these viewers one properly sees the outside front cover at the beginning and the outside back cover at the end. Note that in the case of http://cdn.uforlife.com/public/TLN200806.pdf the inside covers are also properly split. I need to do a similar thing in order to properly generate correctly-sequence .jpg images of the pages ... Q: What attribute / tag / characteristic is in the .pdf file that tells a renderer to split the first page into two pages and insert them at different places in the sequence? The images in the spreads contain the "spine" of the magazine too. I can see this if I use ImageMagick convert to generate the .jpg images. Yet, this is not shown in Evince or Acrobat Reader ... Q: What attribute enabled the "spine" of the book to be cut out and not displayed as part of the front cover nor back cover? EMBEDDED SPREAD QUESTION In both of these sample files there are embedded two-page spreads. In the printed book these spread span two facing pages. In the file http://cdn.uforlife.com/public/TLN200806.pdf they are displayed in evince & acrobat reader on pages 20 & 37. Note that in this case neither evince nor acrobat reader recognizes that these are two-page spreads. Rather, both viewers treat these spreads as a single page. Even in Dual / Two Page view, both programs show another page alongside. Given that the covers & inside covers are handled 'correctly' I am somewhat surprised that the embedded pages are not handled a little better ... Q: Why are these facing pages "spreads" not identified as two individual pages by evince / acrobat reader ? Any other advice from the poppler / PDF gurus would be greatly appreciated. Thanks, Michael _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
