Hi, PDFBox targets ISO-32000.
BR Maruan > Am 20.11.2013 um 19:29 schrieb Rodrigo Caniçali > <[email protected]>: > > Thomas, > > I found several PDF specifications on the net. > > Please, which is the PDF specification followed by PDFBOX library. > > Thanks, > > Rodrigo > > > > Em Quinta-feira, 14 de Novembro de 2013 11:30, Rodrigo Caniçali > <[email protected]> escreveu: > > Hi Thomas, > > There is no such object at the whole document. Looking for the keyword > "/XRef" or "80 0", the editor cannot find them anywhere. However I could find > at the end of the document the following code: > > xref > 0 47 > 0000000000 65535 f > 0000000009 00000 n > 0000052584 00000 n > 0000052633 00000 n > 0000009275 00000 n > 0000000199 00000 n > 0000003543 00000 n > .... > 0000052345 0000 n > > trailer > << > > /Size 47 > /Root 2 0 R > /Info 1 0 R > startxref > 52279 > %%EOF > > Changing the reference 52279 by 53730 which is the address of "xref", it > seems that the xref table position error has been solved. > > But the following warning is still been displayed and some text are still not > been extracted: > > Loading PDF D:\Documents and Settings\05215385726\rpf_tributos.pdf > Time for loading: 0.094 seconds > Starting text extraction > Writing to D:\Documents and Settings\05215385726\rpf_tributos.txt > Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine > processOperator > INFO: unsupported/disabled operation: o > Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine > processOperator > INFO: unsupported/disabled operation: Os > Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine > processOperator > INFO: unsupported/disabled operation: a > Nov 14, 2013 11:24:36 AM org.apache.pdfbox.util.PDFStreamEngine > processOperator > INFO: unsupported/disabled operation: su > > > Also, with the "-nonSeq" option enabled, the error below is displayed: > > Loading PDF D:\Documents and Settings\05215385726\Meus > documentos\rpf_tributos.pdf > Exception in thread "main" java.io.IOException: Error: Expected a long type, > actual='K`_' > at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668) > at > org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1183) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseObjectDynamically(NonSequentialPDFParser.java:1130) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:420) > at > org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702) > at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139) > at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208) > at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) > > > I wonder if I could write a routine to fix a document like this before > parsing it with PDFbox, since it can be parsed by Acrobat Reader. > > Thanks, > > Rodrigo > > > > Em Quarta-feira, 13 de Novembro de 2013 19:49, Thomas Chojecki > <[email protected]> escreveu: > > Hi Rodrigo, > it look like the startxref position (52779) is wrong and point into a > stream instead at the beginning of a xref table or stream. The value > inside the exception shows a compressed string and it might be the > xref stream. > > You can open a hex editor and jump directly to the position 52779 and > look for a object that may look like > > ,--- > > 80 0 obj << > /Type /XRef > /Index [0 424] > /Size 424 > /W [1 3 1] > /Root 421 0 R > /Info 422 0 R > /ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>] > /Length 1073 > /Filter /FlateDecode > stream > ... > endstream > endobj > > `--- > > If you find this object with the /Type /XRef you can go to the > beginning of it, in this case the 80 0 obj and write down the position > of this object. Then you can go to the end of the file and overwrite > the startxref 52779 position with you marked position and try to parse > the document again. > > This should work and indicate that the pdf creator you are using, > creates wrong object positions. Pdfbox can parse only documents that > provide correct xref tables / streams, otherwise the parser does not > know how to handle the document. > > Best regards > Thomas > > > > Zitat von Rodrigo Caniçali <[email protected]>: > >> Hi Thomas, >> >> Below is the stacktrace when the option “-nonSeq” is enabled: >> >> Loading PDF D:\Documents and Settings\05215385726\Meus >> documentos\rpf_tributos.pdf >> Exception in thread "main" java.io.IOException: Error: Expected a >> long type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos' >> at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668) >> at >> org.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598) > at >> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460) >> at >> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358) >> at >> org.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702) >> at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139) >> at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208) >> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) >> >> >> When that option is disabled, the following warnings are printed on >> Eclipse console and some text of PDF document is not extracted: >> >> Loading PDF > D:\Documents and Settings\05215385726\Meus >> documentos\rpf_tributos.pdf >> Nov 04, 2013 10:16:13 AM >> org.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref >> WARNING: Did not found XRef object at specified startxref position 52779 >> Time for loading: 0.125 seconds >> Starting text extraction >> Writing to D:\Documents and Settings\05215385726\Meus >> documentos\rpf_tributos.txt >> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine >> processOperator >> INFO: unsupported/disabled operation: o >> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine >> processOperator >> INFO: unsupported/disabled operation: Os >> Nov 04, > 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine >> processOperator >> INFO: unsupported/disabled operation: a >> Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngine >> processOperator >> INFO: unsupported/disabled operation: su >> >> Thanks, >> >> Rodrigo >> >> >> >> Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali >> <[email protected]> escreveu: >> >> Hi Thomas, >> >> Thanks for your answer. >> >> I am afraid the document > is confidential, but I canprovide the >> stacktrace and find out if it is possible to generate a >> non-confidential example on Monday when I will be at the office again. >> >> Best regards, >> Rodrigo >> >> >> >> >> >> Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki >> <[email protected]> escreveu: >> >> >> Zitat von Rodrigo Caniçali <[email protected]>: >> >>> Hi, >> Hi > Rodrigo, >> >>> I found on a mailing list of 2012-jun-14 that this problem has been >>> already discussed, but here is pretty different. >> I think I found the discussion. >> >>> I also get the warning "Did not found XRef object at specified >>> startxref position xxx" when executing the main function >>> of org.apache.pdfbox.ExtractText class. However, some PDF texts are >>> ignored and are not printed on the output TXT file. These same texts >>> are displayed by Acrobat Reader and can be copyed by the user as >>> texts from this program. >> >> Your document is broken and it work with Acrobat Reader, because he >> isn't > strict enough against the specification. >> >> Many developer that try to create a pdf writer, test it against the >> Acrobat Reader and does not follow always the specification. So the >> reference is to create Acrobat Reader and not specification conformant >> documents. This lead to the problem that 3rd party libraries like >> pdfbox can't sometimes parse such documents. >> >> In your case the xref table isn't there, where the parser supposing >> it. If you can provide use such document, we can try to find the cause >> of the problem and maybe fixing it. >> >>> >>> If the option "-nonSeq" is selected, then appears a >>> "java.io.IOException: Error: > Expected a long type, actual=..." which >>> stops the text extraction. >> Maybe you can post the first three lines from the stacktrace, this >> will help debugging the problem. >> >>> Please, is there any way to make it work? >> It is nearly impossible reconstructing such cases. If you can provide >> us more informations or maybe the document, it will help use improving >> the parser, if possible. >> >> We do our best to support as many document as we can, but in some >> cases we need to be strict to support the existing fine parsing >> documents. This problem is also one point on the agenda of the pdfbox >> 2.0.0 version. >> >> >>> >>> Thanks, >>> >>> Rodrigo >> >> Best regards >> Thomas

