Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files using 
PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the image 
class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a 
different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write it out 
to a file that can then be opened and displayed with any image viewer (or a web 
browser).  The problem is, the colors in the resulting image are simply off.  
Way off.

JPEG & BMP color photo images look about the same, though the color palettes 
are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look fine.
TIFF images and PNG images look completely messed up.  Often turning into black 
rectangles or random color bands.  They also tend to blow up the second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
[email protected]


-----Original Message-----
From: Daniel Wilson [mailto:[email protected]] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: [email protected]
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel <[email protected]>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space needs
> to be rebuilt and reset on each image object prior to writing the image out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> [email protected]
>
>
>
>

Reply via email to