Hi Toël, Thanks for your reply. But I guess my question is more about the pdf file. Is my code extracting the image out of page 2 pixel perfect or is it resampling the page?
On Fri, Mar 11, 2016 at 1:06 AM, Hartmann Toël <[email protected]> wrote: > Hi, > > The dpi information embedded in the image is 300 for EzFQJ9v.png but on > US08000000-20110816-D00001.png it is 72. > I extracted the image of the head only from both the pngs and get two > different pixel size: > > the head in EzFQJ9v.png is 1722x1593, the head in > US08000000-20110816-D00001.png is 1331x1231. > > I would say that Google has a resized image and changed the dpi info to 72. > > The image info for the pdf page is: > position in PDF = -1.2, 0.0 in user space units > raw image size = 2560, 3300 in pixels > displayed size = 614.4, 792.0 in user space units > displayed size = 8.533334, 11.0 in inches > displayed size = 216.74667, 279.4 in millimeters > dpi = 300 dpi (X), 300 dpi (Y) > > > > > /Toël > > On 11 mar 2016, at 09:14, Vince Harron <[email protected]> wrote: > > > Here is the original patent from the US Patent and Trademark Office: > > > > http://pimg-fpiw.uspto.gov/fdd/00/000/080/0.pdf > > > > I'm extracting images as follows: > > > > List<PDPage> list = document.getDocumentCatalog().getAllPages(); > > > > String fileName = srcPdfFile.getName().replace(".pdf", "_cover"); > > int imageNumber = 0; > > for (PDPage page : list) { > > PDResources pdResources = page.getResources(); > > > > Map pageImages = pdResources.getImages(); > > if (pageImages != null) { > > > > Iterator imageIter = pageImages.keySet().iterator(); > > while (imageIter.hasNext()) { > > String key = (String) imageIter.next(); > > PDXObjectImage pdxObjectImage = (PDXObjectImage) > > pageImages.get(key); > > > pdxObjectImage.write2file(srcPdfFile.getAbsolutePath().replace(".pdf", > > String.format("-D%05d.png", imageNumber))); > > imageNumber++; > > } > > } > > } > > > > The image I extract from page 2 looks like this: > > http://i.imgur.com/EzFQJ9v.png > > 2560x3300 (300dpi) > > > > Here is the same image from Google Patents > > > > > https://patentimages.storage.googleapis.com/US8000000B2/US08000000-20110816-D00001.png > > it's only 1446 × 2037 (~224dpi) > > > > The Google image is cropped a bit compared to the PDF page. When I trim > > the my PDF page image down to match the same area as the Google image, > the > > my extracted image is still much higher resolution than the Google > > extracted image (1934 × 2550) > > > > Assumption 1) Google is using the same data source as me (PDF) > > Assumption 2) Google wouldn't downscale technical diagrams in patents > > because they might lose important detail > > > > If my assumptions are correct, I must be extracting the image > incorrectly, > > upsampling the ~224dpi image to 300dpi. Is that what's happening? > > > > Thanks, > > > > Vince > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

