[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing

2014-03-18 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939630#comment-13939630
 ] 

John Hewson commented on PDFBOX-1987:
-

Thanks, it's a tricky problem to solve.

> Provide a PDF Lexer as a base for PDF parsing
> -
>
> Key: PDFBOX-1987
> URL: https://issues.apache.org/jira/browse/PDFBOX-1987
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Reporter: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: src.zip
>
>
> In order to enhance the parsing process and as a foundation for a combination 
> of the different parsers a PDF lexer should be provided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing

2014-03-18 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939008#comment-13939008
 ] 

Maruan Sahyoun commented on PDFBOX-1987:


PDFBOX-276 describes such a file. PDF.js has some files with invalid hex 
strings. There are some files which have missing CR and/or LF at the end of a 
stream ...

> Provide a PDF Lexer as a base for PDF parsing
> -
>
> Key: PDFBOX-1987
> URL: https://issues.apache.org/jira/browse/PDFBOX-1987
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Reporter: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: src.zip
>
>
> In order to enhance the parsing process and as a foundation for a combination 
> of the different parsers a PDF lexer should be provided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938560#comment-13938560
 ] 

John Hewson commented on PDFBOX-1987:
-

{quote}
An are which I kept out is how to handle malformed tokens such as strings which 
have an unbalanced number of parenthesis. 
{quote}

Do you have any sample PDF files with this problem?

> Provide a PDF Lexer as a base for PDF parsing
> -
>
> Key: PDFBOX-1987
> URL: https://issues.apache.org/jira/browse/PDFBOX-1987
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Reporter: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: src.zip
>
>
> In order to enhance the parsing process and as a foundation for a combination 
> of the different parsers a PDF lexer should be provided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing

2014-03-14 Thread Maruan Sahyoun (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13934744#comment-13934744
 ] 

Maruan Sahyoun commented on PDFBOX-1987:


I attached a version of a PDF lexer together with a set of tests and some 
helper classes which extend RandomAccessRead to be able to read test data from 
strings for easier testing.

The purpose is that people who are interested - and have a better programming 
background - can inspect and comment on the code. 

An are which I kept out is how to handle malformed tokens such as strings which 
have an unbalanced number of parenthesis. For a relaxed processing such errors 
should be fixed. For a strict processing such errors should be reported and 
potentially fixed as the process shouldn’t stop with the first error.

The current idea I have in mind is that the lexer throws events in such cases 
which a parser could listen and react upon. Again looking for comments and 
ideas on this.

> Provide a PDF Lexer as a base for PDF parsing
> -
>
> Key: PDFBOX-1987
> URL: https://issues.apache.org/jira/browse/PDFBOX-1987
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Reporter: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: src.zip
>
>
> In order to enhance the parsing process and as a foundation for a combination 
> of the different parsers a PDF lexer should be provided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)