[ 
https://issues.apache.org/jira/browse/PDFBOX-1000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025092#comment-13025092
 ] 

Thomas Chojecki commented on PDFBOX-1000:
-----------------------------------------

Hi,

1) your right, in some case it would be easier to seperate values with space. 

the object reader do the follow. (some abstract code)

readValue
{
  char c1 = readChar;
  if (c1 == SPACE) {
    readValue; // maybe something like skipSpaces at the beginning will be 
better. because someone 
                     // can kill the heap with whitespaces duo to the recursion.
  } else if ( c1 == '/') {
    // unreadByte; // Not needed, because the COSName can read till whitespace 
or / and remove the beginning / if it exist.
    readCOSName;
  } else if ( c1 == INTEGER) {
    unreadByte;
    readCOSIntOrRef;    
  } else if ( c1 == '(' ) {
    // unreadByte; // see readCOSName
    readCOSString;
  }
  ... // and so on. '<' is tricky, can be a dictionary or hex string. then we 
need to read one byte more and see if its a new dict or string.
}

readCOSIntOrRef
{
  buffer b1;
  char c1;
  while((c1 = readChar) != '/' or '>') {
    if(c1 == SPACE) {
      new COSRef(buffer, readChar);
      readChar;readChar; // to skip the R
    }
    writeBuffer(c1);
  }
  new COSInteger(writeBuffer); 
}

readObj 
{
  readCOSName
  readValue
}

to write a parser for a pdf is one of the hardest things. the spec is 
inaccurate and give the developer room for interpret it the way he means is 
spec conform. 

2) one way is to read the last xxx bytes (maybe 100) and search for the 
"startxref". after getting this, we can jump to the xref table / stream and 
parse till end. or we read the last xxx bytes and try to find the "trailer". i 
whould prefer the first step.

after parsing the first xref table and the trailer, we should look if another 
one is in the document and parse it also and skip parsed references.

3) a good parser needs time and we should keep the old implementation if the 
user can't parse all the docs he have. the next thing is the seek time. i can't 
imagine that parsing a document lazy is quicker as parsing it complete from the 
beginning. if the parser need to jump between the objects, this costs much time 
on harddisk. this can take much time. the last question is. how much of the 
document do the user need to parse to get as much informations as he need to 
work with it. if he need to read 50% or 70% so we can parse the whole document.

a) that idea is good. so we can grab minimal informations without parsing it 
complete and the first request for e.g. a page, parse the needed informations.

My plan is to take a look after work and debug some documents and show what 
documents maybe fail and fix it.

> Conforming parser
> -----------------
>
>                 Key: PDFBOX-1000
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1000
>             Project: PDFBox
>          Issue Type: New Feature
>          Components: Parsing
>            Reporter: Adam Nichols
>            Assignee: Adam Nichols
>         Attachments: ConformingPDDocument.java, ConformingPDFParser.java, 
> ConformingPDFParserTest.java, XrefEntry.java, gdb-refcard.pdf
>
>
> A conforming parser will start at the end of the file and read backward until 
> it has read the EOF marker, the xref location, and trailer[1].  Once this is 
> read, it will read in the xref table so it can locate other objects and 
> revisions.  This also allows skipping objects which have been rendered 
> obsolete (per the xref table)[2].  It also allows the minimum amount of 
> information to be read when the file is loaded, and then subsequent 
> information will be loaded if and when it is requested.  This is all laid out 
> in the official PDF specification, ISO 32000-1:2008.
> Existing code will be re-used where possible, but this will require new 
> classes in order to accommodate the lazy reading which is a very different 
> paradigm from the existing parser.  Using separate classes will also 
> eliminate the possibility of regression bugs from making their way into the 
> PDDocument or BaseParser classes.  Changes to existing classes will be kept 
> to a minimum in order to prevent regression bugs.
> [1] Section 7.5.5 "Conforming readers should read a PDF file from its end"
> [2] Section 7.5.4 "the entire file need not be read to locate any particular 
> object"

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to