Hi Lukas,
Done. A snapshot will be available within a few hours here:
https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/1.8.10-SNAPSHOT/
Please test and confirm that it works for you.
About your second question - I have no opinion about this.... The best
would be that you open an issue in JIRA, and explain
- what you need it for
- whether reading or writing
"exposing our privates" is always a controversial topic here :-)
Tilman
Am 22.04.2015 um 18:56 schrieb Tilman Hausherr:
Hi Lukas,
Thanks for your detailed analysis. It's my fault. (See
https://issues.apache.org/jira/browse/PDFBOX-1794 ). I think that the
2nd solution you suggested is the better one. I've opened
https://issues.apache.org/jira/browse/PDFBOX-2772 and will work on
this soon.
Tilman
Am 22.04.2015 um 17:26 schrieb Lukas Schober:
Dear pdfbox-devs,
a co-worker and i are currently developing a service for searching
and replacing content in pdf documents based on pdfbox. We started
our project with the 1.8.2 version of pdfbox and just trying to
migrated to 1.8.8 recently.
On changing to version 1.8.8 we are running into troubles with pdf
content concerning inline images. Our code study of the differences
between those versions of pdfbox led us to the handling of the EI
operator as reason of our troubles.
In version 1.8.2 the method parseNextToken() of the
org.apache.pdfbox.pdfparser.PDFStreamParser does an unread of the EI
token on inline images. In newer versions this unread of the EI token
doesn't exist anymore with the following comment “// the EI operator
isn't unread, as it won't be processed anyway”.
As a consequence the token sets of a document containing an inline
image delivered by the PDFStreamParser can't be used to (re)render a
valid pdf document by the ContentStreamWriter.
The reason is the missing token for the EI operator. Maybe, that the
EI token doesn't trigger any further processing, but it is still
necessary to represent the delimiter in the token sequence.
On the other side if a inline image should be part of a pdf page and
is inserted as a token set manually, the EI token must also be
present in the token set, so that the ContentStreamWriter is able to
create a correct pdf document.
From our point of view there are two simple approaches to get a more
consistent internal representation of pdf documents with pdfbox
concerning inline images. Either represent the EI operator as a token
(revert to handling in version 1.8.2.) explicitly or extend the
writeObject method in the ContentStreamWriter to append the EI
operator implicitly.
Furthermore in our specialization of the PDFTextStripper, the ability
to access the base-class properties from there was a limiting factor.
Are there some reasons that the properties
org.apache.pdfbox.util.PDFTextStripper::startBookmarkPageNumber
org.apache.pdfbox.util.PDFTextStripper::endBookmarkPageNumber
org.apache.pdfbox.util.PDFTextStripper::pageArticles
org.apache.pdfbox.util.PDFTextStripper::characterListMapping
org.apache.pdfbox.util.PDFStreamEngine::streamResourcesStack
org.apache.pdfbox.util.PDFStreamEngine::page
are really necessary to be private, or is it enough restrictive to
be protected so that they can be accessed in derived classes?
Best regards,
Lukas Schober
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]