Thanks, Alex. Unfortunately, that (ExtractImages) is the first place I looked when I started this.
It basically uses the first technique below (Image.write2File(String)). The problem I described also happens with the ExtractImages class. It also happens with PDF2Image - which converts each whole page to an image. Within each page image, the embedded photos all have their colors all screwed up. I've tried this with several PDF input files and it happens with every color photo image. Line art (even if rasterized and embedded as jpeg) and B&W images are fine. I think there is something wrong with how PDFBox is extracting the images. Is no one else seeing this? I'm on a Windows XP PRO (64bit) machine. -mel -----Original Message----- From: Alex Shvartz [mailto:[email protected]] Sent: Monday, September 21, 2009 7:20 PM To: [email protected] Subject: RE: Extracting Images Hi, Please have a look to org.apache.pdfbox.ExtractImages class. In extractImages() method there is a good explanation how to extract image from PDF file and save it. Best Regards. Alex. --- On Mon, 9/21/09, Martinez, Mel <[email protected]> wrote: From: Martinez, Mel <[email protected]> Subject: RE: Extracting Images To: "[email protected]" <[email protected]> Date: Monday, September 21, 2009, 3:31 PM Ugh! I'm crying uncle! I obviously need help (in more ways than one!). If ANYBODY has some experience with extracting jpeg images from PDF files using PDFBox, I'd appreciate a few pointers. I've started with the basics (lotsa null checks & junk removed): PDPage page = .... PDResources resources = page.getResources(); Map<String, PDXObjectImage images = resources.getImages(); So far, so good. Some null & empty tests then ... PDXObjectImage image = images.get(key); At this point, I've tried several things. I've tried just letting the image class write itself out: Image.write2File(fname); //where fname does not include the suffix I've also tried rebuilding the image object from pieces like so: BufferedImage bi = image.getRGBImage(); int bpc = image.getBitsPerComopnent(); PDColorSpace cspace = image.getColorSpace(); ... WritableRaster srcRaster = bi.getRaster(); ... ColorModel cm = cspace.createColorModel(bpc); int h = image.getHeight(); int w = image.getWidth(); WritableRaster raster = cm.createCompatibleWritableRaster(w,h); raster.setRect(srcRaster); bi = new BufferedImage(cm,raster,false,null); ImageIO.write(bi,format,new File(fname+"."+format)); This second method has the advantage of allowing you to write out to a different format, though some conversions crash it or look like garbage. In general, both methods 'work' in that they extract the image and write it out to a file that can then be opened and displayed with any image viewer (or a web browser). The problem is, the colors in the resulting image are simply off. Way off. JPEG & BMP color photo images look about the same, though the color palettes are sometimes off in different ways. JPEG & BMP Black & white images and line art (even color) generally look fine. TIFF images and PNG images look completely messed up. Often turning into black rectangles or random color bands. They also tend to blow up the second code. Does anybody have a clue about this stuff? Thanks in advance, Mel Dr. Mel Martinez [email protected] -----Original Message----- From: Daniel Wilson [mailto:[email protected]] Sent: Tuesday, September 15, 2009 7:33 PM To: [email protected] Subject: Re: Extracting Images I've done battle with the PDXObjectImage, but it has usually defeated me! Sections 4.7 and 4.8 of the PDF spec address it. Daniel On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <[email protected]>wrote: > I've been playing with extracting images. > > I've found a few 'wierdnesses' (I know, that's not a real word) in the > org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try > to submit something on that. > > Ignoring the 'wierdnesses' (which have more to do with options parsing and > filenaming), it does successfully extract images to separate files. > > However, the color table is apparently not being handled properly. > > All the images end up displaying with the default Windows palette, which > tells me that they probably are missing their own. > > I assume that what probably needs to be done is that the color space needs > to be rebuilt and reset on each image object prior to writing the image out > to file, but I'm not entirely certain how to proceed with that. > > Does anybody have any familiarity with the PDXObjectImage and its related > APIs? > > If someone can point me in the right direction, I don't mind doing the work > of fixing this. > > Mel > > Dr. Mel Martinez > [email protected] > > > > __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
