[ https://issues.apache.org/jira/browse/PDFBOX-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr reopened PDFBOX-5152: ------------------------------------- > Content Stream Appears Truncated in Specific File > ------------------------------------------------- > > Key: PDFBOX-5152 > URL: https://issues.apache.org/jira/browse/PDFBOX-5152 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 2.0.23 > Reporter: Steven Fontaine > Priority: Minor > > I'm working on a [utility|https://github.com/acid1103/PDFInverter] to invert > the colors of a PDF file. An > [issue|https://github.com/acid1103/PDFInverter/issues/5] was raised, which > provided a [PDF > file|https://github.com/acid1103/PDFInverter/files/6260470/January.pdf], > which when parsed by pdfbox, appears to give a truncated content stream. That > is, running the following code results in a substantially shorter content > stream than I would expect: > {code:java} > try (PDDocument doc = PDDocument.load(/* January.pdf */)) { > for (PDPage page: doc.getPages()) { > String stream = new String(IOUtils.toByteArray(page.getContents()), > StandardCharsets.UTF_8); > System.out.println(stream); > } > } > {code} > The code outputs the following: > {noformat} > q 0 0 0 rg 0 0 0 RG /GS0 gs /Fm0 Do Q > {noformat} > I'll admit that I don't have the strongest of understandings of PDF content > streams, but I can fairly confidently say that more than this is required to > draw page 1 of the PDF. > Additionally, you can deduce from the linked issue that, internally, pdfbox > is making reference to additional data that isn't contained in the content > stream returned from {{page.getContents()}}. > In my program, I need to find specific substrings in the content stream to > locate specific operations and their arguments. To do so, I [wrap > {{PDFStreamParser.parseNextToken()}} with queries to > {{PDFStreamParser.seqSource.getPosition()}}|https://github.com/acid1103/PDFInverter/blob/1af2e27f98e8251a31f5eefbbd0690caa7cdc23d/src/main/java/org/apache/pdfbox/pdfparser/PDFStreamColorSlicer.java#L52]. > I do so in order to get the bounds of a token in the content stream, without > the need to parse it myself, (allowing {{parseNextToken}} to do the work for > me.) When I look at the bounds which these queries give me, they extend > further than the length of the content stream returned by > {{page.getContents()}}. > Specifically, one set of these bounds is (19, 313), inclusive. In other > words, the token parsed by {{parseNextToken}} corresponds to characters > 19-313 (inclusive, 0-based index) of the content stream. But the content > stream returned by {{page.getContents()}} doesn't contain 313 characters. > Hopefully someone can shed some light on this issue for me. Thanks! -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org