ConformingParser (PDFBOX-1000)

2012-07-19 Thread Maruan Sahyoun
Hi there,

resuming to work on PDFBOX-1000 I came across a question how to maintain some 
state within the base components PDFLexer and Simple Parser (which has yet to 
come). 

E.g. in order to differentiate a number from an indirect object I potentially 
have to read three tokens {num} {gen}  obj to check if {num} is an individual 
number or the start of an indirect object. There are two ways to recover if 
I've read too many tokens and the number was in fact the individual object

a) depend on file position e.g. filePointer and seek
b) maintain some internal state

I currently tend to go for b) as this would remove the dependency on 
filePointer() and seek() or similar methods but that means if the parsing has 
to start from a new point within the file, object etc. there needs too be some 
reset() call to reset the state. Also the caller e.g. ConformingParser has to 
make sure that there is some way to reposition the cursor. On the other hand 
not being dependent on a specific position would enable the PDFLexer and 
SimpleParser to be extended to work on byte[] and similar. 

WDYT

Kind regards

Maruan Sahyoun


Re: ConformingParser (PDFBOX-1000)

2012-07-19 Thread Timo Boehme

Hi,

Am 19.07.2012 13:02, schrieb Maruan Sahyoun:

resuming to work on PDFBOX-1000 I came across a question how to maintain some 
state within the base components PDFLexer and Simple Parser (which has yet to 
come).

E.g. in order to differentiate a number from an indirect object I potentially 
have to read three tokens {num} {gen}  obj to check if {num} is an individual 
number or the start of an indirect object. There are two ways to recover if 
I've read too many tokens and the number was in fact the individual object

a) depend on file position e.g. filePointer and seek
b) maintain some internal state

I currently tend to go for b) as this would remove the dependency on 
filePointer() and seek() or similar methods but that means if the parsing has 
to start from a new point within the file, object etc. there needs too be some 
reset() call to reset the state. Also the caller e.g. ConformingParser has to 
make sure that there is some way to reposition the cursor. On the other hand 
not being dependent on a specific position would enable the PDFLexer and 
SimpleParser to be extended to work on byte[] and similar.

WDYT


why not using o.a.p.io.RandomAccessRead? This interface can be 
implemented for all kinds of input material.



Best regards,

Timo


--

 Timo Boehme
 OntoChem GmbH
 H.-Damerow-Str. 4
 06120 Halle/Saale
 T: +49 345 4780474
 F: +49 345 4780471
 timo.boe...@ontochem.com

_

 OntoChem GmbH
 Geschäftsführer: Dr. Lutz Weber
 Sitz: Halle / Saale
 Registergericht: Stendal
 Registernummer: HRB 215461
_