[ https://issues.apache.org/jira/browse/PDFBOX-1555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14085197#comment-14085197 ]
John Hewson edited comment on PDFBOX-1555 at 8/4/14 8:01 PM: ------------------------------------------------------------- This file is, roughly speaking, valid from an Acrobat perspective. The [Adobe Supplement to the ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf] in _3.4.4 File Trailer_ says: {quote} Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file. {quote} was (Author: jahewson): This file is, roughly speaking, valid from an Acrobat perspective. The [Adobe Supplement to the ISO 32000|http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf] in _3.4.4 File Trailer_ says: {quote} Acrobat viewers require only that the %%EOF marker appear somewhere within the last 1024 bytes of the file. {quote} > Javascript at the end of the PDF document fails parsing > ------------------------------------------------------- > > Key: PDFBOX-1555 > URL: https://issues.apache.org/jira/browse/PDFBOX-1555 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.8.0 > Reporter: Jinder Aujla > Attachments: > 0001-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch, > 0002-MA-1981-Analyzer-Production-heitman.com-PDF-attachme.patch > > > Hi > I was investigating a failure to parse and debugging the pdfbox code when I > noticed in the PDF document that I can't forward at the end of the file this: > %%EOF^M > ^M > ^M > <script type="text/javascript">^M > var gaJsHost = (("https:" == document.location.protocol) ? "https://ssl." : > "http://www.");^M > document.write(unescape("%3Cscript src='" + gaJsHost + > "google-analytics.com/ga.js' type='text/javascript'%3E%3C/script%3E"));^M > </script>^M > <script type="text/javascript">^M > try {^M > var pageTracker = _gat._getTracker("UA-7429935-1");^M > pageTracker._trackPageview();^M > } catch(err) {}</script>^M > ^M > ^M > So the document ends.. but there is more content.. basically some javascript. > What the parser does is it gets to > line 492 in org.apache.pdfbox.pdfparser.PDFParser > isEndOfFile get's set to true, but because it's not the end of the actual > stream.. it continues this was a fix in PDFBOX-979. > Next time around in the loop it reads > <script type="text/javascript"> > which I think it ignores.. then trys to read > var > twice as a number. Then blows up.. so I've playing around thinking of > sensible thing to do. But worried that I might introduce some other issue. I > assume this is legal structure for a PDFDocument. It opens fine in a viewer. -- This message was sent by Atlassian JIRA (v6.2#6252)