RE: Extracting Images

Martinez, Mel Tue, 22 Sep 2009 10:20:35 -0700

Unfortunately, I am not the creator of the PDF documents I need to extract 
from.  The images will come in whatever format they come in.

I see exactly what you describe - blues become pink and the pallete is 'sort of 
reversed'.  But inverting the palette (in an image editor) doesn't quite fix it.

Unfortunately, this is also not my area of expertise so I'm struggling too.  I 
just don't have the luxury to choose which type of image format gets used.

-mel

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, September 22, 2009 11:54 AM
To: [email protected]
Subject: RE: Extracting Images

I noticed that an image with an indexed pallette (tested with BMP, PNG) 
did not look right after encrypting the PDF.  The colors were switched 
around.  I remember that blue became pink, but it wasn't a straight 
inverse.  Writing out the same PDF without encryption worked fine.  If RGB 
is used, it'll work fine whether encrypted or not (tested with PNG).

This doesn't seem to be the same thing you are describing, but it could be 
related.  I don't have the time nor expertise to look into that one so my 
solution was to use RGB images.

--Adam

"Martinez, Mel" <[email protected]> 
09/22/2009 06:52
Please respond to
[email protected]

To
"[email protected]" <[email protected]>
cc

Subject
RE: Extracting Images

Thanks, Alex.

Unfortunately, that (ExtractImages) is the first place I looked when I 
started this.

It basically uses the first technique below (Image.write2File(String)).

The problem I described also happens with the ExtractImages class.  It 
also happens with PDF2Image - which converts each whole page to an image. 
Within each page image, the embedded photos all have their colors all 
screwed up.

I've tried this with several PDF input files and it happens with every 
color photo image.

Line art (even if rasterized and embedded as jpeg) and B&W images are 
fine.

I think there is something wrong with how PDFBox is extracting the images.

Is no one else seeing this?

I'm on a Windows XP PRO (64bit) machine.

-mel

-----Original Message-----
From: Alex Shvartz [mailto:[email protected]] 
Sent: Monday, September 21, 2009 7:20 PM
To: [email protected]
Subject: RE: Extracting Images

Hi,

Please have a look to org.apache.pdfbox.ExtractImages class.
In extractImages() method there is a good explanation how to extract image 
from PDF file and save it.

Best Regards.

Alex.

--- On Mon, 9/21/09, Martinez, Mel <[email protected]> wrote:

From: Martinez, Mel <[email protected]>
Subject: RE: Extracting Images
To: "[email protected]" <[email protected]>
Date: Monday, September 21, 2009, 3:31 PM

Ugh!  I'm crying uncle!  I obviously need help (in more ways than one!).

If ANYBODY has some experience with extracting jpeg images from PDF files 
using PDFBox, I'd appreciate a few pointers.

I've started with the basics (lotsa null checks & junk removed):

    PDPage page = ....
    PDResources resources = page.getResources();
    Map<String, PDXObjectImage images = resources.getImages();

So far, so good.  Some null & empty tests then
    ...
    PDXObjectImage image = images.get(key);

At this point, I've tried several things.  I've tried just letting the 
image class write itself out:

    Image.write2File(fname); //where fname does not include the suffix

I've also tried rebuilding the image object from pieces like so:

    BufferedImage bi = image.getRGBImage();
    int bpc = image.getBitsPerComopnent();
    PDColorSpace cspace = image.getColorSpace();
    ...
    WritableRaster srcRaster = bi.getRaster();
    ...
    ColorModel cm = cspace.createColorModel(bpc);
    int h = image.getHeight();
    int w = image.getWidth();
    WritableRaster raster = cm.createCompatibleWritableRaster(w,h);
    raster.setRect(srcRaster);
    bi = new BufferedImage(cm,raster,false,null);
    ImageIO.write(bi,format,new File(fname+"."+format));

This second method has the advantage of allowing you to write out to a 
different format, though some conversions crash it or look like garbage.

In general, both methods 'work' in that they extract the image and write 
it out to a file that can then be opened and displayed with any image 
viewer (or a web browser).  The problem is, the colors in the resulting 
image are simply off.  Way off.

JPEG & BMP color photo images look about the same, though the color 
palettes are sometimes off in different ways.
JPEG & BMP Black & white images and line art (even color) generally look 
fine.
TIFF images and PNG images look completely messed up.  Often turning into 
black rectangles or random color bands.  They also tend to blow up the 
second code.

Does anybody have a clue about this stuff?

Thanks in advance,

Mel

Dr. Mel Martinez
[email protected]

-----Original Message-----
From: Daniel Wilson [mailto:[email protected]] 
Sent: Tuesday, September 15, 2009 7:33 PM
To: [email protected]
Subject: Re: Extracting Images

I've done battle with the PDXObjectImage, but it has usually defeated me!
Sections 4.7 and 4.8 of the PDF spec address it.

Daniel

On Tue, Sep 15, 2009 at 6:01 PM, Martinez, Mel 
<[email protected]>wrote:

> I've been playing with extracting images.
>
> I've found a few 'wierdnesses' (I know, that's not a real word) in the
> org.apache.pdfbox.ExtractText class and If I can clear some time, I'll 
try
> to submit something on that.
>
> Ignoring the 'wierdnesses' (which have more to do with options parsing 
and
> filenaming), it does successfully extract images to separate files.
>
> However, the color table is apparently not being handled properly.
>
> All the images end up displaying with the default Windows palette, which
> tells me that they probably are missing their own.
>
> I assume that what probably needs to be done is that the color space 
needs
> to be rebuilt and reset on each image object prior to writing the image 
out
> to file, but I'm not entirely certain how to proceed with that.
>
> Does anybody have any familiarity with the PDXObjectImage and its 
related
> APIs?
>
> If someone can point me in the right direction, I don't mind doing the 
work
> of fixing this.
>
> Mel
>
> Dr. Mel Martinez
> [email protected]
>
>
>
>

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

?  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage 
Company, Inc.  is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or the taking of any action in reliance on 
the contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call  (800) 453 7884.

RE: Extracting Images

Reply via email to