inline images – EI operator

Lukas Schober Wed, 22 Apr 2015 08:48:13 -0700

Dear pdfbox-devs,

a co-worker and i are currently developing a service for searching andreplacing content in pdf documents based on pdfbox. We started ourproject with the 1.8.2 version of pdfbox and just trying to migrated to1.8.8 recently.

On changing to version 1.8.8 we are running into troubles with pdfcontent concerning inline images. Our code study of the differencesbetween those versions of pdfbox led us to the handling of the EIoperator as reason of our troubles.

In version 1.8.2 the method parseNextToken() of theorg.apache.pdfbox.pdfparser.PDFStreamParser does an unread of the EItoken on inline images. In newer versions this unread of the EI tokendoesn't exist anymore with the following comment “// the EI operatorisn't unread, as it won't be processed anyway”.

As a consequence the token sets of a document containing an inline imagedelivered by the PDFStreamParser can't be used to (re)render a valid pdfdocument by the ContentStreamWriter.The reason is the missing token for the EI operator. Maybe, that the EItoken doesn't trigger any further processing, but it is still necessaryto represent the delimiter in the token sequence.

On the other side if a inline image should be part of a pdf page and isinserted as a token set manually, the EI token must also be present inthe token set, so that the ContentStreamWriter is able to create acorrect pdf document.

From our point of view there are two simple approaches to get a moreconsistent internal representation of pdf documents with pdfboxconcerning inline images. Either represent the EI operator as a token(revert to handling in version 1.8.2.) explicitly or extend thewriteObject method in the ContentStreamWriter to append the EI operatorimplicitly.

Furthermore in our specialization of the PDFTextStripper, the ability toaccess the base-class properties from there was a limiting factor. Arethere some reasons that the properties


org.apache.pdfbox.util.PDFTextStripper::startBookmarkPageNumber
org.apache.pdfbox.util.PDFTextStripper::endBookmarkPageNumber
org.apache.pdfbox.util.PDFTextStripper::pageArticles
org.apache.pdfbox.util.PDFTextStripper::characterListMapping
org.apache.pdfbox.util.PDFStreamEngine::streamResourcesStack
org.apache.pdfbox.util.PDFStreamEngine::page

are really necessary to be private, or is it enough restrictive to beprotected so that they can be accessed in derived classes?


Best regards,
Lukas Schober


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

inline images – EI operator

Reply via email to