Re: [DISCUSS] PDFBox and Exception handling

John Hewson Fri, 14 Feb 2014 03:50:32 -0800

Maruan

> On 14 Feb 2014, at 00:43, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
> 
> John,
> 
> 
>> Am 13.02.2014 um 21:57 schrieb John Hewson <j...@jahewson.com>:
>> 
>> Maruan
>> 
>> I’ve given this some considerable thought and realised that while it may 
>> technically work, it won’t be maintainable.
>> 
>>> the user is making use of PDFBox because they don’t want to have to parse 
>>> the PDF file themselves. 
>>>> 
>>>> But in case a file is invalid for whatever reason the only options are to 
>>>> either wait until we include a workaround or put it in themselves. And if 
>>>> you don’t want to write your handler you are not enforced to do so.
>> 
>> It seems to me that any user who knows enough about the internals of the 
>> PDFBox parser to write an error handler should know how to make those 
>> changes in PDFBox and submit a patch. If their patch is good and comes with 
>> an example PDF then it will be in the latest snapshot within a day or so. 
>> Users who don’t know what they’re doing shouldn’t be writing error handlers, 
>> and users who know what they’re doing don’t need to.
>> 
>>>> From a technical standpoint, exposing the internal parser context to the 
>>>> user seems particularly problematic: the internal implementation details 
>>>> which are part of the context now become part of PDFBox’s public API which 
>>>> needs to be kept stable between major releases. How is the user to resolve 
>>>> a non-trivial exception and allow parsing to continue in a manner which 
>>>> leaves the internals of the parser in a consistent state? If we don’t know 
>>>> how users are resolving exceptions out in the real world, how can we be 
>>>> sure that changes we make to the parser later won’t break their code?
>>> 
>>> One can only assume that a documented API is stable. As long as this is the 
>>> case why should it break their code. Of course if a different file is 
>>> causing a similar exception which will be dealt with by the exception 
>>> handler and the code is not able to deal with it ...
>> 
>> 
>> You’ve missed the major point here: any “context” would necessarily expose 
>> the *internal* implementation details of the parser, which breaks the 
>> principle of encapsulation and will make PDFBox harder to maintain without 
>> introducing breaking changes. This is because every internal detail of a 
>> context becomes a fixed part of the public API, as does when and where it is 
>> called, i.e. the *behaviour* is part of the public API, not just the type 
>> signature. Changes to when and where error contexts are used and exactly 
>> how, will be part of the public behaviour and thus must be kept stable. The 
>> end result is that it will be incredibly hard to make internal changes to 
>> the parser without breaking major version compatibility. 
>> 
>> Imagine a scenario where a user has a number of PDFs some of which are known 
>> to contain errors which require manual review of the files after they are 
>> processed. The user discovers that all such erroneous PDFs trigger one of 
>> the parser's error handlers for invalid date parsing, and so they write a 
>> handler callback which correctly parses the date and flags the PDF in their 
>> system as being in need of manual review. Imagine that at some later point, 
>> a helpful person comes along and fixes the underling date parsing issue in 
>> the core PDFBox parser, looks good, right? Boom, that was a breaking change 
>> because it altered the public behaviour of the parser. The invalid date 
>> handler in our user’s code will no longer be called because it is now 
>> handled internally by PDFBox, so the PDF won’t be flagged for review, their 
>> customers receive broken PDFs, and our user’s boss is very unhappy.
>> 
>> Here’s another scenario: a user adds a handler to recover from a parsing 
>> error reading a corrupted stream. They have access to lots of example PDF 
>> files and do an amazing job and handle all sorts of strange edge cases. Then 
>> a PDFBox committer comes along and fixes the same issue in the core parser 
>> but they did not have as many example files which contain all the weird edge 
>> cases handled by our user. The committer’s change doesn’t alter any type 
>> signatures on the public API so it’s not a breaking change right? Wrong, it 
>> changed the public behaviour of the class. The user’s handler will now never 
>> be called, and because we didn’t fix the same edge cases that they did they 
>> will now silently start getting corrupt streams.
> 
> 
> It’s about giving people a way to deal with exceptions caused by malformed 
> PDFs using information and state from within PDFBox. And yes, if we fix a bug 
> and now an exception is no longer raised because it shouldn’t have been 
> raised if a correct implementation would have existed and an application was 
> dependent on our misbehavior you are right. But that would mean that every 
> improvement or bugfix which changes the result breaks the contract. E.g. lets 
> say that we extract additional text, or we no longer extract text that should 
> not have been extracted or we render a PDF differently …. 
> 
> So I do get and understand you point - I don’t share your view though.


That's an overly-literal interpretation of a contract, fixing a parsing bug is 
not going to violate the contract of e.g. PDDocument.load() because that 
contract is "the method either returns a PDDocument object or throws an 
IOException", that it may or may not do so for a given file does not matter. 

Under your error handling scheme the contract changes to "the method either 
returns a PDDocument or calls an error handler providing the error context". 
The problem here is that the error context is likely to change as bug fixes are 
made, *even if they don't fix the original bug that causes the user's handler 
to be called*. Imagine that some unrelated bug fix causes the parser's internal 
state to be different from what the user expected it to be when their error 
handler is called. Their handler might silently make an incorrect repair, 
because PDFBox violated its contract and started passing different context *for 
the same root cause*.

There's just no way to violate the principle of encapsulation and not pay a 
heavy price for it by surprising your users with silent failures and public 
behaviour changes which can't be fathomed without examining the PDFBox source 
code. It's is the opposite of sound OO design.

Who wants to use an error handler that runs the risk of silently causing damage 
in the future? It creates more problems than it ever fixes.

>> 
>>> It’s not about supporting different standards […] It’s about having a core 
>>> which handles conformant files and an extension which handles workarounds 
>>> for nonconformant files.
>> 
>> A commonly used approach to parsing programming languages is to have a core 
>> language which is small, easily parsed and with an AST which is easy to 
>> manipulate. On top of that is another parser which handles all of the 
>> syntactic sugar of the language, transforming a complex concrete AST into a 
>> simple core AST. Perhaps PDFBox could take a similar approach with 
>> ConformingParser having a NonConformingParser subclass which is capable of 
>> pre-processing bad PDF files before they reach the core parser. The actual 
>> implementation may be more subtle than this, perhaps with some 
>> back-and-forth between the conforming and non-conforming parsers, so that 
>> when the conforming parser encounters an error it can call a protected 
>> method which in ConformingParser would throw an error but in 
>> NonConformingParser would perform a recovery, as you proposed. But by using 
>> protected methods we avoid the maintainability problem caused by making the 
>> error recovery mechanism public.
>> 
>> What do you think?
> 
> This is a good and valid approach, but doesn’t address the intention I had.

I'm not sure the intention matters, we should be trying to find the best 
solution to a problem. Let's start over: what is the problem?

> 
>> 
>> -- John
>> 
>>> On 13 Feb 2014, at 10:57, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
>>> 
>>> John
>>> 
>>>> Am 13.02.2014 um 18:50 schrieb John Hewson <johnahew...@yahoo.co.uk>:
>>>> 
>>>> Maruan,
>>>> 
>>>>> Now let’s assume there is a situation where an object is not at a certain 
>>>>> location, or a specific string is missing …. what if we throw an 
>>>>> exception where one could register a handler. We pass some kind of 
>>>>> context e.g. lexer, file position, token …. and the user can handle the 
>>>>> exception and „enrich“ the content or pass the correct information.
>>>> 
>>>> The idea sounds reasonable in theory, but the more I reflect on in the 
>>>> more I think that we should assume that the user is making use of PDFBox 
>>>> because they don’t want to have to parse the PDF file themselves. I can’t 
>>>> think of an example where the knowledge of how to correct some invalid PDF 
>>>> would’t be better off existing within PDFBox itself, rather than in user 
>>>> code.
>>> 
>>> Of course they don’t want to parse it themselves. They can expect that 
>>> PDFBox can handle a valid PDF file. But in case a file is invalid for 
>>> whatever reason the only options are to either wait until we include a 
>>> workaround or put it in themselves. The idea is to have an entry point. 
>>> What’s the benefit of an exception when one can’t do anything about it.  
>>> And if you don’t want to write your handler you are not enforced to do so. 
>>> 
>>>> 
>>>> From a technical standpoint, exposing the internal parser context to the 
>>>> user seems particularly problematic: the internal implementation details 
>>>> which are part of the context now become part of PDFBox’s public API which 
>>>> needs to be kept stable between major releases. How is the user to resolve 
>>>> a non-trivial exception and allow parsing to continue in a manner which 
>>>> leaves the internals of the parser in a consistent state? If we don’t know 
>>>> how users are resolving exceptions out in the real world, how can we be 
>>>> sure that changes we make to the parser later won’t break their code?
>>> 
>>> One can only assume that a documented API is stable. As long as this is the 
>>> case why should it break their code. Of course if a different file is 
>>> causing a similar exception which will be dealt with by the exception 
>>> handler and the code is not able to deal with it ...
>>> 
>>>> 
>>>>> In addition to that we are able to extend from a strictly conformant 
>>>>> parsing to a relaxed parsing by using the same mechanism thus having the 
>>>>> workarounds not in the ‚core‘ parser.
>>>> 
>>>> 
>>>> My suggestion would be to either subclass the core parser or pass it a 
>>>> “conformance level” argument, e.g. PDF_1_5 or PDF_X. I don’t think any 
>>>> external error handling/recovery mechanism is going to work in practice, 
>>>> especially if that means generating thousands of exceptions when given a 
>>>> bad content stream.
>>> 
>>> It’s not about supporting different standards - that’s different thing 
>>> (currently PDFBox doesn’t have concept of applying standards or versions - 
>>> functions are either available or not, regardless of when they became part 
>>> of the PDF spec). It’s about having a core which handles conformant files 
>>> and an extension which handles workarounds for nonconformant files. 
>>> Currently that’s all within the code - sometimes marked, sometimes not - 
>>> which makes it difficult to rewrite the parser. As you already found out 
>>> sometimes a fix was made to handle a single occurrence of a file and the 
>>> file itself might no longer exist.
>>> 
>>> 
>>>> -- John
>>>> 
>>>>> On 13 Feb 2014, at 03:24, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
>>>>> 
>>>>> Hi John,
>>>>> 
>>>>> currently pdfbox mostly throws IOExceptions where the user of the lib is 
>>>>> not able to do something about it. 
>>>>> 
>>>>> Some of these exceptions could occur because a file was not found etc. So 
>>>>> that’s ok. Others might occur because objects are not at a certain 
>>>>> position. There are workarounds for some of these in pdfbox e.g. if %%EOF 
>>>>> ist not the last entry in a PDF. Thus users are dependent on us putting 
>>>>> in the workarounds to handle such situations. 
>>>>> 
>>>>> Now let’s assume there is a situation where an object is not at a certain 
>>>>> location, or a specific string is missing …. what if we throw an 
>>>>> exception where one could register a handler. We pass some kind of 
>>>>> context e.g. lexer, file position, token …. and the user can handle the 
>>>>> exception and „enrich“ the content or pass the correct information. The 
>>>>> exception is than resolved and the process can continue.
>>>>> 
>>>>> In addition to that we are able to extend from a strictly conformant 
>>>>> parsing to a relaxed parsing by using the same mechanism thus having the 
>>>>> workarounds not in the ‚core‘ parser.
>>>>> 
>>>>> BR
>>>>> Maruan Sahyoun
>>>>> 
>>>>>> Am 13.02.2014 um 09:44 schrieb John Hewson <j...@jahewson.com>:
>>>>>> 
>>>>>> I'm not sure in understand what you mean, the Camel examples are very 
>>>>>> complex indeed. A quick concrete example of what you're after would help 
>>>>>> greatly.
>>>>>> 
>>>>>> -- John
>>>>>> 
>>>>>>> On 13 Feb 2014, at 00:20, Maruan Sahyoun <sahy...@fileaffairs.de> wrote:
>>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> what do you think of having an exception handling in pdfbox where 
>>>>>>> people could define their own handlers. Something similar to
>>>>>>> 
>>>>>>> https://camel.apache.org/exception-clause.html
>>>>>>> 
>>>>>>> The benefit would be that we could pass the context e.g. during PDF 
>>>>>>> parsing and the handler could return something which is than taken as 
>>>>>>> the input. In addition to that maybe we can think about having some 
>>>>>>> additional types of exceptions instead of mostly IOException to support 
>>>>>>> that.  
>>>>>>> 
>>>>>>> BR
>>>>>>> Maruan Sahyoun
>

Re: [DISCUSS] PDFBox and Exception handling

Reply via email to