[jira] [Updated] (PDFBOX-1359) stack overflow~~ ExtractText (PDF2TXT)
[ https://issues.apache.org/jira/browse/PDFBOX-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] GloryKim updated PDFBOX-1359: - Summary: stack overflow~~ ExtractText (PDF2TXT) (was: stack overflow~~ ) > stack overflow~~ ExtractText (PDF2TXT) > > > Key: PDFBOX-1359 > URL: https://issues.apache.org/jira/browse/PDFBOX-1359 > Project: PDFBox > Issue Type: Bug > Components: Utilities >Affects Versions: 1.7.0 > Environment: Eclipse >Reporter: GloryKim >Priority: Critical > Attachments: 10946_2004_Article_340818.pdf > > > java.io.IOException: Error: Could not find font(COSName{F1}) in > map={F27=org.apache.pdfbox.pdmodel.font.PDType1Font@40bb2bc3, > F8=org.apache.pdfbox.pdmodel.font.PDType1Font@40363068, > F56=org.apache.pdfbox.pdmodel.font.PDType1Font@25a41cc7, > F7=org.apache.pdfbox.pdmodel.font.PDType1Font@395d601f, > F13=org.apache.pdfbox.pdmodel.font.PDType1Font@2151b0a5} > java.io.IOException: Error: Could not find font(COSName{F1}) in > map={F27=org.apache.pdfbox.pdmodel.font.PDType1Font@40bb2bc3, > F8=org.apache.pdfbox.pdmodel.font.PDType1Font@40363068, > F56=org.apache.pdfbox.pdmodel.font.PDType1Font@25a41cc7, > F7=org.apache.pdfbox.pdmodel.font.PDType1Font@395d601f, > F13=org.apache.pdfbox.pdmodel.font.PDType1Font@2151b0a5} > at > org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:57) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238) > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238) > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238) > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238) > at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77) > at > org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274) > at > org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: ConformingParser (PDFBOX-1000)
Hi, Am 19.07.2012 13:02, schrieb Maruan Sahyoun: resuming to work on PDFBOX-1000 I came across a question how to maintain some state within the base components PDFLexer and Simple Parser (which has yet to come). E.g. in order to differentiate a number from an indirect object I potentially have to read three tokens {num} {gen} obj to check if {num} is an individual number or the start of an indirect object. There are two ways to recover if I've read too many tokens and the number was in fact the individual object a) depend on file position e.g. filePointer and seek b) maintain some internal state I currently tend to go for b) as this would remove the dependency on filePointer() and seek() or similar methods but that means if the parsing has to start from a new point within the file, object etc. there needs too be some reset() call to reset the state. Also the caller e.g. ConformingParser has to make sure that there is some way to reposition the cursor. On the other hand not being dependent on a specific position would enable the PDFLexer and SimpleParser to be extended to work on byte[] and similar. WDYT why not using o.a.p.io.RandomAccessRead? This interface can be implemented for all kinds of input material. Best regards, Timo -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
ConformingParser (PDFBOX-1000)
Hi there, resuming to work on PDFBOX-1000 I came across a question how to maintain some state within the base components PDFLexer and Simple Parser (which has yet to come). E.g. in order to differentiate a number from an indirect object I potentially have to read three tokens {num} {gen} obj to check if {num} is an individual number or the start of an indirect object. There are two ways to recover if I've read too many tokens and the number was in fact the individual object a) depend on file position e.g. filePointer and seek b) maintain some internal state I currently tend to go for b) as this would remove the dependency on filePointer() and seek() or similar methods but that means if the parsing has to start from a new point within the file, object etc. there needs too be some reset() call to reset the state. Also the caller e.g. ConformingParser has to make sure that there is some way to reposition the cursor. On the other hand not being dependent on a specific position would enable the PDFLexer and SimpleParser to be extended to work on byte[] and similar. WDYT Kind regards Maruan Sahyoun
Object scanning (was: Re: Apache PDFBox July 2012 board report due)
Hi Am 19.07.2012 10:03, schrieb Maruan Sahyoun: maybe wie can join forces here as I'm currently working on an Xref class which parses xref tables and xref streams. One method should also do the mentioned scanning. Sure. I haven't started yet thus we can discuss the details. What I had in mind was a fast scanning of line starts with object start, endobj, endstream. With this we can detect missing endobj/endstream etc. Furthermore we can correct xref entries which sometimes are some bytes off. Embedded, not extra encoded PDFs can make some trouble here but as long as the embedding object and the embedded PDF is correct this can be handled - furthermore this method is only needed for broken PDFs and most of them won't have such embedded PDFs. Kind regards, Timo Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler": Timo Boehme hat am 16. Juli 2012 um 18:02 geschrieben: Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: Am 10.07.2012 09:16, schrieb Timo Boehme: ... For the next time I plan to improve on the broken document robustness of the parser by doing a first scan over the document (in case of parsing failure), collecting object start/end points and using them to repair xref table. Seems to be necessary, at least for some PDFs. :-( Another task I would like to do is reducing the amount of memory needed by using the existing file as input stream resource instead of copying an object stream first to a temporary buffer (in cases where an input file exists). Maybe for this we should change from assuming to have an input stream to assuming we have an input file and if we have an input stream a temporary file is created on the fly - WDYT? I guess internally we have to use something abstract and as everything is a stream the might be a good choice. AFAIU the current implementation, one reason for the usage of a temporary buffer is the fact that the data is modified (decompressing, decrypting) and we must not alter the input data. It is perhaps a better idea to somehow split the inputstream and the unfilteredinputstream, e.g. read from the inputstream every time an object is dereferenced and store the (decompressed) data in the corresponding object. Kind regards, Timo BR Andreas Lehmkühler -- Timo Boehme OntoChem GmbH H.-Damerow-Str. 4 06120 Halle/Saale T: +49 345 4780474 F: +49 345 4780471 timo.boe...@ontochem.com _ OntoChem GmbH Geschäftsführer: Dr. Lutz Weber Sitz: Halle / Saale Registergericht: Stendal Registernummer: HRB 215461 _
Re: Apache PDFBox July 2012 board report due
Hi, maybe wie can join forces here as I'm currently working on an Xref class which parses xref tables and xref streams. One method should also do the mentioned scanning. Kind regards Maruan Sahyoun Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler" : > > Timo Boehme hat am 16. Juli 2012 um 18:02 > geschrieben: > >> Hi, >> >> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: >>> Am 10.07.2012 09:16, schrieb Timo Boehme: ... looks good to me. Some mention about the preflight module which will be integrated in the next major release? >>> Thanks for your comment. I added some information about preflight/xmpbox >>> as you maybe already have seen. >> >> Yes, thank you very much for all the time spending on administrative >> tasks/improvements on PDFBOX. >> >> For the next time I plan to improve on the broken document robustness of >> the parser by doing a first scan over the document (in case of parsing >> failure), collecting object start/end points and using them to repair >> xref table. > > > Seems to be necessary, at least for some PDFs. :-( > > >> Another task I would like to do is reducing the amount of memory needed >> by using the existing file as input stream resource instead of copying >> an object stream first to a temporary buffer (in cases where an input >> file exists). >> Maybe for this we should change from assuming to have an input stream to >> assuming we have an input file and if we have an input stream a >> temporary file is created on the fly - WDYT? > > > I guess internally we have to use something abstract and as everything is a > stream > the might be a good choice. AFAIU the current implementation, one reason for > the > usage of a temporary buffer is the fact that the data is modified > (decompressing, > decrypting) and we must not alter the input data. It is perhaps a better idea > to > somehow split the inputstream and the unfilteredinputstream, e.g. read from > the > inputstream every time an object is dereferenced and store the (decompressed) > data in the corresponding object. > >> >> >> Kind regards, >> Timo > > > BR > Andreas Lehmkühler
Re: Apache PDFBox July 2012 board report due
Timo Boehme hat am 16. Juli 2012 um 18:02 geschrieben: > Hi, > > Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler: > > Am 10.07.2012 09:16, schrieb Timo Boehme: > >> ... > >> looks good to me. Some mention about the preflight module which will be > >> integrated in the next major release? > > Thanks for your comment. I added some information about preflight/xmpbox > > as you maybe already have seen. > > Yes, thank you very much for all the time spending on administrative > tasks/improvements on PDFBOX. > > For the next time I plan to improve on the broken document robustness of > the parser by doing a first scan over the document (in case of parsing > failure), collecting object start/end points and using them to repair > xref table. Seems to be necessary, at least for some PDFs. :-( > Another task I would like to do is reducing the amount of memory needed > by using the existing file as input stream resource instead of copying > an object stream first to a temporary buffer (in cases where an input > file exists). > Maybe for this we should change from assuming to have an input stream to > assuming we have an input file and if we have an input stream a > temporary file is created on the fly - WDYT? I guess internally we have to use something abstract and as everything is a stream the might be a good choice. AFAIU the current implementation, one reason for the usage of a temporary buffer is the fact that the data is modified (decompressing, decrypting) and we must not alter the input data. It is perhaps a better idea to somehow split the inputstream and the unfilteredinputstream, e.g. read from the inputstream every time an object is dereferenced and store the (decompressed) data in the corresponding object. > > > Kind regards, > Timo BR Andreas Lehmkühler
[jira] [Commented] (PDFBOX-1000) Conforming parser
[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418137#comment-13418137 ] Maruan Sahyoun commented on PDFBOX-1000: I added a new version of the PDFLexer. Changes a) the PDFLexer is now using InputStream as the PDF source. This makes it possible to use the new IO classes in o.a.pdfbox.io. b) refactored the PDFLexer so the only io operation used is read() c) drawback is that one needs to call reset() if the position in the stream is changed by a seek operation in order to clear the internal state d) StringBuilder is now reused instead of recreated for every new token > Conforming parser > - > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing >Reporter: Adam Nichols >Assignee: Adam Nichols > Attachments: COSUnread.java, ConformingPDDocument.java, > ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, > PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, > XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf > > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PDFBOX-1000) Conforming parser
[ https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maruan Sahyoun updated PDFBOX-1000: --- Attachment: PDFStreamConstants.java PDFLexer.java New version of the PDFLexer. > Conforming parser > - > > Key: PDFBOX-1000 > URL: https://issues.apache.org/jira/browse/PDFBOX-1000 > Project: PDFBox > Issue Type: New Feature > Components: Parsing >Reporter: Adam Nichols >Assignee: Adam Nichols > Attachments: COSUnread.java, ConformingPDDocument.java, > ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, > PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, > XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf > > > A conforming parser will start at the end of the file and read backward until > it has read the EOF marker, the xref location, and trailer[1]. Once this is > read, it will read in the xref table so it can locate other objects and > revisions. This also allows skipping objects which have been rendered > obsolete (per the xref table)[2]. It also allows the minimum amount of > information to be read when the file is loaded, and then subsequent > information will be loaded if and when it is requested. This is all laid out > in the official PDF specification, ISO 32000-1:2008. > Existing code will be re-used where possible, but this will require new > classes in order to accommodate the lazy reading which is a very different > paradigm from the existing parser. Using separate classes will also > eliminate the possibility of regression bugs from making their way into the > PDDocument or BaseParser classes. Changes to existing classes will be kept > to a minimum in order to prevent regression bugs. > [1] Section 7.5.5 "Conforming readers should read a PDF file from its end" > [2] Section 7.5.4 "the entire file need not be read to locate any particular > object" -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira