Maruan,

> Now let’s assume there is a situation where an object is not at a certain 
> location, or a specific string is missing …. what if we throw an exception 
> where one could register a handler. We pass some kind of context e.g. lexer, 
> file position, token …. and the user can handle the exception and „enrich“ 
> the content or pass the correct information.

The idea sounds reasonable in theory, but the more I reflect on in the more I 
think that we should assume that the user is making use of PDFBox because they 
don’t want to have to parse the PDF file themselves. I can’t think of an 
example where the knowledge of how to correct some invalid PDF would’t be 
better off existing within PDFBox itself, rather than in user code.

From a technical standpoint, exposing the internal parser context to the user 
seems particularly problematic: the internal implementation details which are 
part of the context now become part of PDFBox’s public API which needs to be 
kept stable between major releases. How is the user to resolve a non-trivial 
exception and allow parsing to continue in a manner which leaves the internals 
of the parser in a consistent state? If we don’t know how users are resolving 
exceptions out in the real world, how can we be sure that changes we make to 
the parser later won’t break their code?

> In addition to that we are able to extend from a strictly conformant parsing 
> to a relaxed parsing by using the same mechanism thus having the workarounds 
> not in the ‚core‘ parser.


My suggestion would be to either subclass the core parser or pass it a 
“conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any external 
error handling/recovery mechanism is going to work in practice, especially if 
that means generating thousands of exceptions when given a bad content stream.

-- John

On 13 Feb 2014, at 03:24, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:

> Hi John,
> 
> currently pdfbox mostly throws IOExceptions where the user of the lib is not 
> able to do something about it. 
> 
> Some of these exceptions could occur because a file was not found etc. So 
> that’s ok. Others might occur because objects are not at a certain position. 
> There are workarounds for some of these in pdfbox e.g. if %%EOF ist not the 
> last entry in a PDF. Thus users are dependent on us putting in the 
> workarounds to handle such situations. 
> 
> Now let’s assume there is a situation where an object is not at a certain 
> location, or a specific string is missing …. what if we throw an exception 
> where one could register a handler. We pass some kind of context e.g. lexer, 
> file position, token …. and the user can handle the exception and „enrich“ 
> the content or pass the correct information. The exception is than resolved 
> and the process can continue.
> 
> In addition to that we are able to extend from a strictly conformant parsing 
> to a relaxed parsing by using the same mechanism thus having the workarounds 
> not in the ‚core‘ parser.
> 
> BR
> Maruan Sahyoun
> 
> Am 13.02.2014 um 09:44 schrieb John Hewson <j...@jahewson.com>:
> 
>> I'm not sure in understand what you mean, the Camel examples are very 
>> complex indeed. A quick concrete example of what you're after would help 
>> greatly.
>> 
>> -- John
>> 
>>> On 13 Feb 2014, at 00:20, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
>>> 
>>> Hi,
>>> 
>>> what do you think of having an exception handling in pdfbox where people 
>>> could define their own handlers. Something similar to
>>> 
>>> https://camel.apache.org/exception-clause.html
>>> 
>>> The benefit would be that we could pass the context e.g. during PDF parsing 
>>> and the handler could return something which is than taken as the input. In 
>>> addition to that maybe we can think about having some additional types of 
>>> exceptions instead of mostly IOException to support that.  
>>> 
>>> BR
>>> Maruan Sahyoun
>>> 
> 

Reply via email to