[jira] [Updated] (PDFBOX-1359) stack overflow~~ ExtractText (PDF2TXT)

2012-07-19 Thread GloryKim (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

GloryKim updated PDFBOX-1359:
-

Summary: stack overflow~~  ExtractText (PDF2TXT)   (was: stack overflow~~ )

> stack overflow~~  ExtractText (PDF2TXT) 
> 
>
> Key: PDFBOX-1359
> URL: https://issues.apache.org/jira/browse/PDFBOX-1359
> Project: PDFBox
>  Issue Type: Bug
>  Components: Utilities
>Affects Versions: 1.7.0
> Environment: Eclipse
>Reporter: GloryKim
>Priority: Critical
> Attachments: 10946_2004_Article_340818.pdf
>
>
>  java.io.IOException: Error: Could not find font(COSName{F1}) in 
> map={F27=org.apache.pdfbox.pdmodel.font.PDType1Font@40bb2bc3, 
> F8=org.apache.pdfbox.pdmodel.font.PDType1Font@40363068, 
> F56=org.apache.pdfbox.pdmodel.font.PDType1Font@25a41cc7, 
> F7=org.apache.pdfbox.pdmodel.font.PDType1Font@395d601f, 
> F13=org.apache.pdfbox.pdmodel.font.PDType1Font@2151b0a5}
> java.io.IOException: Error: Could not find font(COSName{F1}) in 
> map={F27=org.apache.pdfbox.pdmodel.font.PDType1Font@40bb2bc3, 
> F8=org.apache.pdfbox.pdmodel.font.PDType1Font@40363068, 
> F56=org.apache.pdfbox.pdmodel.font.PDType1Font@25a41cc7, 
> F7=org.apache.pdfbox.pdmodel.font.PDType1Font@395d601f, 
> F13=org.apache.pdfbox.pdmodel.font.PDType1Font@2151b0a5}
>   at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:57)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238)
>   at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238)
>   at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238)
>   at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238)
>   at org.apache.pdfbox.util.operator.Invoke.process(Invoke.java:77)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:562)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:274)
>   at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:238)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: ConformingParser (PDFBOX-1000)

2012-07-19 Thread Timo Boehme

Hi,

Am 19.07.2012 13:02, schrieb Maruan Sahyoun:

resuming to work on PDFBOX-1000 I came across a question how to maintain some 
state within the base components PDFLexer and Simple Parser (which has yet to 
come).

E.g. in order to differentiate a number from an indirect object I potentially 
have to read three tokens {num} {gen}  obj to check if {num} is an individual 
number or the start of an indirect object. There are two ways to recover if 
I've read too many tokens and the number was in fact the individual object

a) depend on file position e.g. filePointer and seek
b) maintain some internal state

I currently tend to go for b) as this would remove the dependency on 
filePointer() and seek() or similar methods but that means if the parsing has 
to start from a new point within the file, object etc. there needs too be some 
reset() call to reset the state. Also the caller e.g. ConformingParser has to 
make sure that there is some way to reposition the cursor. On the other hand 
not being dependent on a specific position would enable the PDFLexer and 
SimpleParser to be extended to work on byte[] and similar.

WDYT


why not using o.a.p.io.RandomAccessRead? This interface can be 
implemented for all kinds of input material.



Best regards,

Timo


--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



ConformingParser (PDFBOX-1000)

2012-07-19 Thread Maruan Sahyoun
Hi there,

resuming to work on PDFBOX-1000 I came across a question how to maintain some 
state within the base components PDFLexer and Simple Parser (which has yet to 
come). 

E.g. in order to differentiate a number from an indirect object I potentially 
have to read three tokens {num} {gen}  obj to check if {num} is an individual 
number or the start of an indirect object. There are two ways to recover if 
I've read too many tokens and the number was in fact the individual object

a) depend on file position e.g. filePointer and seek
b) maintain some internal state

I currently tend to go for b) as this would remove the dependency on 
filePointer() and seek() or similar methods but that means if the parsing has 
to start from a new point within the file, object etc. there needs too be some 
reset() call to reset the state. Also the caller e.g. ConformingParser has to 
make sure that there is some way to reposition the cursor. On the other hand 
not being dependent on a specific position would enable the PDFLexer and 
SimpleParser to be extended to work on byte[] and similar. 

WDYT

Kind regards

Maruan Sahyoun


Object scanning (was: Re: Apache PDFBox July 2012 board report due)

2012-07-19 Thread Timo Boehme

Hi

Am 19.07.2012 10:03, schrieb Maruan Sahyoun:


maybe wie can join forces here as I'm currently working on an Xref
class which parses xref tables and xref streams. One method should
also do the mentioned scanning.


Sure. I haven't started yet thus we can discuss the details. What I had 
in mind was a fast scanning of line starts with object start, endobj, 
endstream. With this we can detect missing endobj/endstream etc. 
Furthermore we can correct xref entries which sometimes are some bytes 
off. Embedded, not extra encoded PDFs can make some trouble here but as 
long as the embedding object and the embedded PDF is correct this can be 
handled - furthermore this method is only needed for broken PDFs and 
most of them won't have such embedded PDFs.



Kind regards,

Timo



Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler":

Timo Boehme  hat am 16. Juli 2012 um 18:02
geschrieben:

Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:

Am 10.07.2012 09:16, schrieb Timo Boehme:

...

For the next time I plan to improve on the broken document robustness of
the parser by doing a first scan over the document (in case of parsing
failure), collecting object start/end points and using them to repair
xref table.


Seems to be necessary, at least for some PDFs. :-(


Another task I would like to do is reducing the amount of memory needed
by using the existing file as input stream resource instead of copying
an object stream first to a temporary buffer (in cases where an input
file exists).
Maybe for this we should change from assuming to have an input stream to
assuming we have an input file and if we have an input stream a
temporary file is created on the fly - WDYT?


I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.




Kind regards,
Timo



BR
Andreas Lehmkühler



--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_



Re: Apache PDFBox July 2012 board report due

2012-07-19 Thread Maruan Sahyoun
Hi,

maybe wie can join forces here as I'm currently working on an Xref class which 
parses xref tables and xref streams. One method should also do the mentioned 
scanning.

Kind regards

Maruan Sahyoun

Am 19.07.2012 um 09:42 schrieb "Andreas Lehmkühler" :

> 
> Timo Boehme  hat am 16. Juli 2012 um 18:02
> geschrieben:
> 
>> Hi,
>> 
>> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
>>> Am 10.07.2012 09:16, schrieb Timo Boehme:
 ...
 looks good to me. Some mention about the preflight module which will be
 integrated in the next major release?
>>> Thanks for your comment. I added some information about preflight/xmpbox
>>> as you maybe already have seen.
>> 
>> Yes, thank you very much for all the time spending on administrative
>> tasks/improvements on PDFBOX.
>> 
>> For the next time I plan to improve on the broken document robustness of
>> the parser by doing a first scan over the document (in case of parsing
>> failure), collecting object start/end points and using them to repair
>> xref table.
> 
> 
> Seems to be necessary, at least for some PDFs. :-(
> 
> 
>> Another task I would like to do is reducing the amount of memory needed
>> by using the existing file as input stream resource instead of copying
>> an object stream first to a temporary buffer (in cases where an input
>> file exists).
>> Maybe for this we should change from assuming to have an input stream to
>> assuming we have an input file and if we have an input stream a
>> temporary file is created on the fly - WDYT?
> 
> 
> I guess internally we have to use something abstract and as everything is a
> stream
> the might be a good choice. AFAIU the current implementation, one reason for 
> the
> usage of a temporary buffer is the fact that the data is modified
> (decompressing,
> decrypting) and we must not alter the input data. It is perhaps a better idea 
> to
> somehow split the inputstream and the unfilteredinputstream, e.g. read from 
> the
> inputstream every time an object is dereferenced and store the (decompressed)
> data in the corresponding object.
> 
>> 
>> 
>> Kind regards,
>> Timo
> 
> 
> BR
> Andreas Lehmkühler


Re: Apache PDFBox July 2012 board report due

2012-07-19 Thread Andreas Lehmkühler

Timo Boehme  hat am 16. Juli 2012 um 18:02
geschrieben:

> Hi,
>
> Am 16.07.2012 17:48, schrieb Andreas Lehmkuehler:
> > Am 10.07.2012 09:16, schrieb Timo Boehme:
> >> ...
> >> looks good to me. Some mention about the preflight module which will be
> >> integrated in the next major release?
> > Thanks for your comment. I added some information about preflight/xmpbox
> > as you maybe already have seen.
>
> Yes, thank you very much for all the time spending on administrative
> tasks/improvements on PDFBOX.
>
> For the next time I plan to improve on the broken document robustness of
> the parser by doing a first scan over the document (in case of parsing
> failure), collecting object start/end points and using them to repair
> xref table.


Seems to be necessary, at least for some PDFs. :-(


> Another task I would like to do is reducing the amount of memory needed
> by using the existing file as input stream resource instead of copying
> an object stream first to a temporary buffer (in cases where an input
> file exists).
> Maybe for this we should change from assuming to have an input stream to
> assuming we have an input file and if we have an input stream a
> temporary file is created on the fly - WDYT?


I guess internally we have to use something abstract and as everything is a
stream
the might be a good choice. AFAIU the current implementation, one reason for the
usage of a temporary buffer is the fact that the data is modified
(decompressing,
decrypting) and we must not alter the input data. It is perhaps a better idea to
somehow split the inputstream and the unfilteredinputstream, e.g. read from the
inputstream every time an object is dereferenced and store the (decompressed)
data in the corresponding object.

>
>
> Kind regards,
> Timo


BR
Andreas Lehmkühler

[jira] [Commented] (PDFBOX-1000) Conforming parser

2012-07-19 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418137#comment-13418137
 ] 

Maruan Sahyoun commented on PDFBOX-1000:


I added a new version of the PDFLexer. 
Changes
a) the PDFLexer is now using InputStream as the PDF source. This makes it 
possible to use the new IO classes in o.a.pdfbox.io.
b) refactored the PDFLexer so the only io operation used is read()
c) drawback is that one needs to call reset() if the position in the stream is 
changed by a seek operation in order to clear the internal state
d) StringBuilder is now reused instead of recreated for every new token  


> Conforming parser
> -
>
> Key: PDFBOX-1000
> URL: https://issues.apache.org/jira/browse/PDFBOX-1000
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Parsing
>Reporter: Adam Nichols
>Assignee: Adam Nichols
> Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, 
> PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, 
> XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (PDFBOX-1000) Conforming parser

2012-07-19 Thread Maruan Sahyoun (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maruan Sahyoun updated PDFBOX-1000:
---

Attachment: PDFStreamConstants.java
PDFLexer.java

New version of the PDFLexer.

> Conforming parser
> -
>
> Key: PDFBOX-1000
> URL: https://issues.apache.org/jira/browse/PDFBOX-1000
> Project: PDFBox
>  Issue Type: New Feature
>  Components: Parsing
>Reporter: Adam Nichols
>Assignee: Adam Nichols
> Attachments: COSUnread.java, ConformingPDDocument.java, 
> ConformingPDFParser.java, ConformingPDFParserTest.java, PDFLexer.java, 
> PDFLexer.java, PDFStreamConstants.java, PDFStreamConstants.java, 
> XrefEntry.java, conforming-parser.patch, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira