At 03:18 PM 2/25/2006, Petter Nyström wrote:
You could use iText to extract the
images, though you'd also need a VERY detailed
understanding of image handling and color
management in order to make sure that the
extracted data was in the correct form.
That sounds problematic. I assume I have been
wrong in my assumption that the stream data of
the PDF holds raw image data in some format, be it jpeg of tiff or other?
Yes, you are wrong in that assumption - at least partially.
Image data in PDF is either in JPEG/JFIF
format (which can just be written out to a file)
- OR it is simply an array of "pixels" in the
specified colorspace. So in the latter case
(which is probably the more common), you would
need to transform the data into something usuable
in JPEG, TIFF, etc. This may include not only
file format, but also colorspace handling since
PDF supports 11 colorspaces while JPEG (for example) only does 2.
It is not as simple as taking this data and
writing it to a file, and voila there's the image?
Correct, it is not that simple.
Depending on what types of
modifications you are going to allow the 3rd
party tools to do, it MIGHT be possible to use
iText, but you'd need to work at a very low
level of PDF functionality to find, modify and replace the relevant objects.
But do iText have support for working at this
low level, or will I need to write my own
routines for hacking into the PDF syntax?
No, all the PDF syntax stuff is done for
you. HOWEVER, you will need to understand WHAT
PDF "objects" you need to add/modify, etc.
When I set out on my search for PDF libraries,
my highest goal was really to find a PDF
parser. I would love to find code that takes a
PDF document and turns it into a data
structure representing the elements in the PDF
- i.e. a parse tree. Then I could traverse
this tree and do whatever modifications I'd
like to the nodes therein. When finished, I'd
need some code to turn the parse tree back into a flat string - a PDF document.
There are a couple of commercial libraries that offer this feature.
Alright, I regret my statement that non-open
source solutions were out of the picture. Please
share the names on these libraries! =)
Adobe's PDFLibrary and PDF.NET
(http://www.pdftron.com/) both offer this.
Specifically, I think I can get my hands on the
official Adobe SDK for PDF:s. Does anyone know
what sort of support that library could give me?
Well, there are two different aspects to
the Acrobat SDK. First are the tools for
building plugins to Adobe Acrobat, and the second
is the Adobe PDFLibrary for stand-alone
applications. Both offer what you need, though
only the second could be used for server-side
solutions - but the first is FREE (minus the cost
of Acrobat, of course) and the second is quite expensive.
Also, would anyone have an opinion of how feasible this sort of approach is?
The approach you describe is that taken
by a number of commercial solutions - including
those from my company. So it's quite feasible and is the right approach.
Are there for example official formal grammars
available for the PDF syntax? Something you
could feed to lex/yacc or the similar. (I think
not, because I spent quite some time looking.)
It's been tried, but the PDF syntax doesn't fit well to BNF.
Leonard
---------------------------------------------------------------------------
Leonard Rosenthol <mailto:[EMAIL PROTECTED]>
Chief Technical Officer <http://www.pdfsages.com>
PDF Sages, Inc. 215-938-7080 (voice)
215-938-0880 (fax)
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions