[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939630#comment-13939630 ] John Hewson commented on PDFBOX-1987: - Thanks, it's a tricky problem to solve. > Provide a PDF Lexer as a base for PDF parsing > - > > Key: PDFBOX-1987 > URL: https://issues.apache.org/jira/browse/PDFBOX-1987 > Project: PDFBox > Issue Type: Improvement > Components: Parsing >Reporter: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.0 > > Attachments: src.zip > > > In order to enhance the parsing process and as a foundation for a combination > of the different parsers a PDF lexer should be provided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13939008#comment-13939008 ] Maruan Sahyoun commented on PDFBOX-1987: PDFBOX-276 describes such a file. PDF.js has some files with invalid hex strings. There are some files which have missing CR and/or LF at the end of a stream ... > Provide a PDF Lexer as a base for PDF parsing > - > > Key: PDFBOX-1987 > URL: https://issues.apache.org/jira/browse/PDFBOX-1987 > Project: PDFBox > Issue Type: Improvement > Components: Parsing >Reporter: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.0 > > Attachments: src.zip > > > In order to enhance the parsing process and as a foundation for a combination > of the different parsers a PDF lexer should be provided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938560#comment-13938560 ] John Hewson commented on PDFBOX-1987: - {quote} An are which I kept out is how to handle malformed tokens such as strings which have an unbalanced number of parenthesis. {quote} Do you have any sample PDF files with this problem? > Provide a PDF Lexer as a base for PDF parsing > - > > Key: PDFBOX-1987 > URL: https://issues.apache.org/jira/browse/PDFBOX-1987 > Project: PDFBox > Issue Type: Improvement > Components: Parsing >Reporter: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.0 > > Attachments: src.zip > > > In order to enhance the parsing process and as a foundation for a combination > of the different parsers a PDF lexer should be provided. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing
[ https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13934744#comment-13934744 ] Maruan Sahyoun commented on PDFBOX-1987: I attached a version of a PDF lexer together with a set of tests and some helper classes which extend RandomAccessRead to be able to read test data from strings for easier testing. The purpose is that people who are interested - and have a better programming background - can inspect and comment on the code. An are which I kept out is how to handle malformed tokens such as strings which have an unbalanced number of parenthesis. For a relaxed processing such errors should be fixed. For a strict processing such errors should be reported and potentially fixed as the process shouldn’t stop with the first error. The current idea I have in mind is that the lexer throws events in such cases which a parser could listen and react upon. Again looking for comments and ideas on this. > Provide a PDF Lexer as a base for PDF parsing > - > > Key: PDFBOX-1987 > URL: https://issues.apache.org/jira/browse/PDFBOX-1987 > Project: PDFBox > Issue Type: Improvement > Components: Parsing >Reporter: Maruan Sahyoun >Priority: Minor > Fix For: 2.0.0 > > Attachments: src.zip > > > In order to enhance the parsing process and as a foundation for a combination > of the different parsers a PDF lexer should be provided. -- This message was sent by Atlassian JIRA (v6.2#6252)