Doug Moreland <doug.moreland <at> gmail.com> writes:

> 
> 
> Looking for advice on the best approach to do something others may have tried.
I have PDFs with text and graphics. They are very large, due to the graphics. I
want to read each PDF and produce a new PDF with the text just as it was in the
original, but rasterize the rest of the graphics into a fairly low res bitmap to
be added behind the text, reducing the overall filesize of the bitmap. I do not
need to manipulate the text, just replicate it. Everything else can be 
bitmapped.
> Any starting places would be greatly appreciated. thank you.
> 

I wouldn't wish that on a PDF guru.  Ouch.

First of all, I'm going to ass-u-me that you mean "raster image" when you say
"graphic".

It'll have to go Something Like This:

1) enumerate the XObject resources from each page and XObject Forms looking for
XObject Images.
2) Extract the image data, resample them at a lower resolution, and write them
back out over the original Image.
3) Update the content streams where those images are used to reflect their new
dimensions.

#1 isn't all that hard.

#2's difficulty depends in part on how many different image types you need to
deal with and what image libraries you have access to.  Heck, plain ol' Java
might do to the trick in some cases.  You can set a PRStream's data directly,
though you might have to deal with some compression filters that iText doesn't
know about (yet).  It really depends on the format.


#3 will be Non-Trivial in direct proportion to how many different applications
are producing your PDFs, and how many different ways they do it.  You need to
track down the instructions used to draw the image (no mean feat), then Change
Them (potential nightmare).

For example, a simple image draw command might look like This:
q
50 0 0 50 25 25 cm /Img1 Do
Q

This draws a 50x50 resource called Img1 at 25, 25.

'cm' stands for Concatenate Matrix.  If the image was rotated, you'll get to
Have Fun With Trigonometry.  If it was SKEWED, you're in hell.

Fortunately, you shouldn't have to track the current transformation matrix, just
the proportionate difference between the old resolution and the new resolution.

OldXResolution / NewXResolution = OldXScale / Z, solve for Z.  

("solve for X" could have been a bit confusing, eh?)

I don't think you'll need to change the x,y offsets at all, so long as your
output is scaled properly.




The alternative approach is easier in the rasterization side, and much harder
when it comes to content parsing/manipulation.

Render the pages to some image format at the resolution you want using any
available PDF renderer (GhostScript for example).  Draw that into the page as
the "under content" using a PdfStamper.

Now parse the content streams, keeping track of all the graphic state as you go,
and yank out all those graphics you don't want, leaving the text Where It Was
Before (no small trick).

This approach will work even if the "Graphics" you're trying to remove are line
art, pattern fills, or what have you.  It'll just be Very Difficult to do all
the extra parsing.  

OTOH, if the graphics you're getting rid of are all raster images, you can
really just get rid of the "/Img1 Do" calls.  They're (almost?) always wrapped
in a 'q/Q' pair (save state, restore state), so they don't have any impact on
the surrounding drawing commands... meaning you don't have to track any state.

This approach is cutting a lot of corners from a General Solution because when
you have a limited number of programs producing your PDFs (hopefully "1"), you
can start to make Assumptions about how their content streams will be laid out.
 This can be Quite Dangerous, particularly if they change their content
formatting in some minor revision and blow your corner-cutting parser all to
hell.  You Have Been Warned.

I take it from your question that you don't know all that much about PDF?  You
miiight want to contract this one out... Would it be too mercenary to suggest
itextsoftware.com?  Yeah?  Ah well.


--Mark Storer


------------------------------------------------------------------------------
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Buy the iText book: http://www.1t3xt.com/docs/book.php

Reply via email to