Unfortunately, I am not the creator of the PDF documents I need to extract from. The images will come in whatever format they come in.
I see exactly what you describe - blues become pink and the pallete is 'sort of reversed'. But inverting the palette (in an image editor) doesn't quite fix it. Unfortunately, this is also not my area of expertise so I'm struggling too. I just don't have the luxury to choose which type of image format gets used. -mel -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Tuesday, September 22, 2009 11:54 AM To: [email protected] Subject: RE: Extracting Images I noticed that an image with an indexed pallette (tested with BMP, PNG) did not look right after encrypting the PDF. The colors were switched around. I remember that blue became pink, but it wasn't a straight inverse. Writing out the same PDF without encryption worked fine. If RGB is used, it'll work fine whether encrypted or not (tested with PNG). This doesn't seem to be the same thing you are describing, but it could be related. I don't have the time nor expertise to look into that one so my solution was to use RGB images. --Adam "Martinez, Mel" <[email protected]> 09/22/2009 06:52 Please respond to [email protected] To "[email protected]" <[email protected]> cc Subject RE: Extracting Images Thanks, Alex. Unfortunately, that (ExtractImages) is the first place I looked when I started this. It basically uses the first technique below (Image.write2File(String)). The problem I described also happens with the ExtractImages class. It also happens with PDF2Image - which converts each whole page to an image. Within each page image, the embedded photos all have their colors all screwed up. I've tried this with several PDF input files and it happens with every color photo image. Line art (even if rasterized and embedded as jpeg) and B&W images are fine. I think there is something wrong with how PDFBox is extracting the images. Is no one else seeing this? I'm on a Windows XP PRO (64bit) machine. -mel -----Original Message----- From: Alex Shvartz [mailto:[email protected]] Sent: Monday, September 21, 2009 7:20 PM To: [email protected] Subject: RE: Extracting Images Hi, Please have a look to org.apache.pdfbox.ExtractImages class. In extractImages() method there is a good explanation how to extract image from PDF file and save it. Best Regards. Alex. --- On Mon, 9/21/09, Martinez, Mel <[email protected]> wrote: From: Martinez, Mel <[email protected]> Subject: RE: Extracting Images To: "[email protected]" <[email protected]> Date: Monday, September 21, 2009, 3:31 PM Ugh! I'm crying uncle! I obviously need help (in more ways than one!). If ANYBODY has some experience with extracting jpeg images from PDF files using PDFBox, I'd appreciate a few pointers. I've started with the basics (lotsa null checks & junk removed): PDPage page = .... PDResources resources = page.getResources(); Map<String, PDXObjectImage images = resources.getImages(); So far, so good. Some null & empty tests then ... PDXObjectImage image = images.get(key); At this point, I've tried several things. I've tried just letting the image class write itself out: Image.write2File(fname); //where fname does not include the suffix I've also tried rebuilding the image object from pieces like so: BufferedImage bi = image.getRGBImage(); int bpc = image.getBitsPerComopnent(); PDColorSpace cspace = image.getColorSpace(); ... WritableRaster srcRaster = bi.getRaster(); ... ColorModel cm = cspace.createColorModel(bpc); int h = image.getHeight(); int w = image.getWidth(); WritableRaster raster = cm.createCompatibleWritableRaster(w,h); raster.setRect(srcRaster); bi = new BufferedImage(cm,raster,false,null); ImageIO.write(bi,format,new File(fname+"."+format)); This second method has the advantage of allowing you to write out to a different format, though some conversions crash it or look like garbage. In general, both methods 'work' in that they extract the image and write it out to a file that can then be opened and displayed with any image viewer (or a web browser). The problem is, the colors in the resulting image are simply off. Way off. JPEG & BMP color photo images look about the same, though the color palettes are sometimes off in different ways. JPEG & BMP Black & white images and line art (even color) generally look fine. TIFF images and PNG images look completely messed up. Often turning into black rectangles or random color bands. They also tend to blow up the second code. Does anybody have a clue about this stuff? Thanks in advance, Mel Dr. Mel Martinez [email protected] -----Original Message----- From: Daniel Wilson [mailto:[email protected]] Sent: Tuesday, September 15, 2009 7:33 PM To: [email protected] Subject: Re: Extracting Images I've done battle with the PDXObjectImage, but it has usually defeated me! Sections 4.7 and 4.8 of the PDF spec address it. Daniel On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <[email protected]>wrote: > I've been playing with extracting images. > > I've found a few 'wierdnesses' (I know, that's not a real word) in the > org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try > to submit something on that. > > Ignoring the 'wierdnesses' (which have more to do with options parsing and > filenaming), it does successfully extract images to separate files. > > However, the color table is apparently not being handled properly. > > All the images end up displaying with the default Windows palette, which > tells me that they probably are missing their own. > > I assume that what probably needs to be done is that the color space needs > to be rebuilt and reset on each image object prior to writing the image out > to file, but I'm not entirely certain how to proceed with that. > > Does anybody have any familiarity with the PDXObjectImage and its related > APIs? > > If someone can point me in the right direction, I don't mind doing the work > of fixing this. > > Mel > > Dr. Mel Martinez > [email protected] > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ? Click here to submit conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or the taking of any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.
