At 03:18 PM 2/25/2006, Petter Nyström wrote:
You could use iText to extract the images, though you'd also need a VERY detailed understanding of image handling and color management in order to make sure that the extracted data was in the correct form.

That sounds problematic. I assume I have been wrong in my assumption that the stream data of the PDF holds raw image data in some format, be it jpeg of tiff or other?

        Yes, you are wrong in that assumption - at least partially.

Image data in PDF is either in JPEG/JFIF format (which can just be written out to a file) - OR it is simply an array of "pixels" in the specified colorspace. So in the latter case (which is probably the more common), you would need to transform the data into something usuable in JPEG, TIFF, etc. This may include not only file format, but also colorspace handling since PDF supports 11 colorspaces while JPEG (for example) only does 2.


It is not as simple as taking this data and writing it to a file, and voila there's the image?

        Correct, it is not that simple.


Depending on what types of modifications you are going to allow the 3rd party tools to do, it MIGHT be possible to use iText, but you'd need to work at a very low level of PDF functionality to find, modify and replace the relevant objects.

But do iText have support for working at this low level, or will I need to write my own routines for hacking into the PDF syntax?

No, all the PDF syntax stuff is done for you. HOWEVER, you will need to understand WHAT PDF "objects" you need to add/modify, etc.


When I set out on my search for PDF libraries, my highest goal was really to find a PDF parser. I would love to find code that takes a PDF document and turns it into a data structure representing the elements in the PDF - i.e. a parse tree. Then I could traverse this tree and do whatever modifications I'd like to the nodes therein. When finished, I'd need some code to turn the parse tree back into a flat string - a PDF document.

       There are a couple of commercial libraries that offer this feature.

Alright, I regret my statement that non-open source solutions were out of the picture. Please share the names on these libraries! =)

Adobe's PDFLibrary and PDF.NET (http://www.pdftron.com/) both offer this.


Specifically, I think I can get my hands on the official Adobe SDK for PDF:s. Does anyone know what sort of support that library could give me?

Well, there are two different aspects to the Acrobat SDK. First are the tools for building plugins to Adobe Acrobat, and the second is the Adobe PDFLibrary for stand-alone applications. Both offer what you need, though only the second could be used for server-side solutions - but the first is FREE (minus the cost of Acrobat, of course) and the second is quite expensive.


Also, would anyone have an opinion of how feasible this sort of approach is?

The approach you describe is that taken by a number of commercial solutions - including those from my company. So it's quite feasible and is the right approach.


Are there for example official formal grammars available for the PDF syntax? Something you could feed to lex/yacc or the similar. (I think not, because I spent quite some time looking.)

        It's been tried, but the PDF syntax doesn't fit well to BNF.


Leonard

---------------------------------------------------------------------------
Leonard Rosenthol                            <mailto:[EMAIL PROTECTED]>
Chief Technical Officer                      <http://www.pdfsages.com>
PDF Sages, Inc.                              215-938-7080 (voice)
                                             215-938-0880 (fax)



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
iText-questions mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/itext-questions

Reply via email to