[jira] [Updated] (PDFBOX-1256) [PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base class

Craig Ringer (Updated) (JIRA) Mon, 19 Mar 2012 06:22:09 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Craig Ringer updated PDFBOX-1256:
---------------------------------

    Description: 
The attached patch restructures PDFStreamEngine to move the basic functionality 
of invoking callbacks for each operator in a stream into a parent class. The 
parent class knows nothing about the meaning of operators, it just invokes 
handlers with accumulated arguments whenever it encounters an operator. 
PDFStreamEngine retains all the "knowledge" of what those operators mean, the 
state of the graphics state stack, etc.

The purpose of the change is to make it simpler and easier to use PDFBox's PDF 
stream processor/parser code without dealing with the full features of 
PDFStreamEngine with its built-in operator handlers, awareness of the graphics 
stack, etc when that functionality isn't required. Specifically, I needed to 
write a tool that copies a PDF stream, renaming resource references as it goes 
but otherwise leaving it unchanged. I wanted to handle all operators including 
future or unknown ones, and only needed to special-case a couple of them. 
PDFStreamEngine was poorly suited to that because it doesn't support a default 
handler fallback, tries to "understand" the stream, etc. Rather than write a 
new class that duplicated much of PDFStreamEngine I thought I'd try to factor 
the required functionality out, so others could use it too.

The changes should be backward compatible with existing code that uses 
PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were 
required for the test suite to pass, text extraction tool to work, etc. 
Nonetheless, it's possible you'll only consider these changes for inclusion in 
PDFBox 2.0, in which case they can be cleaned up to remove some of the backward 
compatibility crap that's currently in them. Let me know.

In terms of open issues or TODOs, the class naming could probably use work. I 
can't rename PDFStreamEngine or OperatorProcessor for backward compatibility 
reasons, so I've had to come up with more contrived names than I'd like.

The logic of the changes is:

- Move content stream argument accumulation and operator callback functionality 
into new PDFStreamProcessor class

- Add support for a default (fallback) handler to PDFStreamProcessor so 
operators not explicitly matched may be handled

- Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its 
existing methods though some are now inherited.

- Deprecate the properties-map based configuration of PDFStreamEngine because 
it'll be fragile whenever more than one classloader is in use. Add 
PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent functionality 
using a type-safe, multi-classloader-safe HashMap of operator names to handler 
instances. This isn't added as a ctor override because operator handler 
registration/unregistration methods are not final (to preserve compatibility 
with PDFStreamEngine) and if overridden, they might use data from a 
not-yet-initialized derived class. If a ctor override is required then 
registerOperatorProcessor must be made final, breaking BC with PDFStreamEngine.

- Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). 
Instances of this are bound to a particular PDFStreamEngine via the `context' 
property and they carry state when they don't have to. They're also an abstract 
class, so handlers can't extend any other class. OperatorProcessor based 
handlers continue to be supported just fine via a simple wrapper that's used 
automatically where required.

- Introduce new OperatorProcessorHandler interface to replace OperatorProcessor 
. It's a simple one-method interface that passes the PDFStreamProcessor as an 
argument, so application designers are free to choose whether to tie their 
OperationProcessorHandler implementations to PDFStreamProcessor instances or 
whether they want to re-use the same handler on many different processors. This 
change is useful for my app and removes unnecessary stateful API, but isn't 
strictly necessary and can be dropped while retaining the PDFStreamEngine / 
PDFStreamProcessor split. As part of the API change, new-interface handlers are 
passed the original arguments array rather than a copy; if they want a copy of 
the arguments array they have to take it themselves, so that resources aren't 
wasted copying the array when handlers don't actually need it copied.

- Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor 
implementations are wrapped in a helper that translates 
OperatorProcessorHandler interface usage to the usage required by 
OperatorProcessor. All the wrapper does is set the context (which 
PDFStreamEngine seems to do before every handler call) then pass a copy of the 
arguments array.


I'm aware that this is a non-trivial change I'm proposing, but I think it 
significantly improves the API (especially once the BC stuff can be removed for 
PDFBox 2.0) and makes it easier to use this functionality.

Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263

  was:
The attached patch restructures PDFStreamEngine to move the basic functionality 
of invoking callbacks for each operator in a stream into a parent class. The 
parent class knows nothing about the meaning of operators, it just invokes 
handlers with accumulated arguments whenever it encounters an operator. 
PDFStreamEngine retains all the "knowledge" of what those operators mean, the 
state of the graphics state stack, etc.

The purpose of the change is to make it simpler and easier to use PDFBox's PDF 
stream processor/parser code without dealing with the full features of 
PDFStreamEngine with its built-in operator handlers, awareness of the graphics 
stack, etc when that functionality isn't required. Specifically, I needed to 
write a tool that copies a PDF stream, renaming resource references as it goes 
but otherwise leaving it unchanged. I wanted to handle all operators including 
future or unknown ones, and only needed to special-case a couple of them. 
PDFStreamEngine was poorly suited to that because it doesn't support a default 
handler fallback, tries to "understand" the stream, etc. Rather than write a 
new class that duplicated much of PDFStreamEngine I thought I'd try to factor 
the required functionality out, so others could use it too.

The changes should be backward compatible with existing code that uses 
PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were 
required for the test suite to pass, text extraction tool to work, etc. 
Nonetheless, it's possible you'll only consider these changes for inclusion in 
PDFBox 2.0, in which case they can be cleaned up to remove some of the backward 
compatibility crap that's currently in them. Let me know.

In terms of open issues or TODOs, the class naming could probably use work. I 
can't rename PDFStreamEngine or OperatorProcessor for backward compatibility 
reasons, so I've had to come up with more contrived names than I'd like.

The logic of the changes is:

- Move content stream argument accumulation and operator callback functionality 
into new PDFStreamProcessor class

- Add support for a default (fallback) handler to PDFStreamProcessor so 
operators not explicitly matched may be handled

- Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its 
existing methods though some are now inherited.

- Deprecate the properties-map based configuration of PDFStreamEngine because 
it'll be fragile whenever more than one classloader is in use. Add 
PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent functionality 
using a type-safe, multi-classloader-safe HashMap of operator names to handler 
instances. This isn't added as a ctor override because operator handler 
registration/unregistration methods are not final (to preserve compatibility 
with PDFStreamEngine) and if overridden, they might use data from a 
not-yet-initialized derived class. If a ctor override is required then 
registerOperatorProcessor must be made final, breaking BC with PDFStreamEngine.

- Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). 
Instances of this are bound to a particular PDFStreamEngine via the `context' 
property and they carry state when they don't have to. They're also an abstract 
class, so handlers can't extend any other class. OperatorProcessor based 
handlers continue to be supported just fine via a simple wrapper that's used 
automatically where required.

- Introduce new OperatorProcessorHandler interface to replace OperatorProcessor 
. It's a simple one-method interface that passes the PDFStreamProcessor as an 
argument, so application designers are free to choose whether to tie their 
OperationProcessorHandler implementations to PDFStreamProcessor instances or 
whether they want to re-use the same handler on many different processors. This 
change is useful for my app and removes unnecessary stateful API, but isn't 
strictly necessary and can be dropped while retaining the PDFStreamEngine / 
PDFStreamProcessor split. As part of the API change, new-interface handlers are 
passed the original arguments array rather than a copy; if they want a copy of 
the arguments array they have to take it themselves, so that resources aren't 
wasted copying the array when handlers don't actually need it copied.

- Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor 
implementations are wrapped in a helper that translates 
OperatorProcessorHandler interface usage to the usage required by 
OperatorProcessor. All the wrapper does is set the context (which 
PDFStreamEngine seems to do before every handler call) then pass a copy of the 
arguments array.


I'm aware that this is a non-trivial change I'm proposing, but I think it 
significantly improves the API (especially once the BC stuff can be removed for 
PDFBox 2.0) and makes it easier to use this functionality.

    
> [PATCH] Split PDFStreamEngine, moving functionality to simpler stream 
> processor base class
> ------------------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1256
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1256
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Parsing, Text extraction
>    Affects Versions: 1.7.0, 2.0.0
>         Environment: N/A
>            Reporter: Craig Ringer
>            Priority: Minor
>              Labels: api, refactoring
>         Attachments: 
> 0002-New-PDFStreamProcessor-base-of-PDFStreamEngine-adds-.patch
>
>
> The attached patch restructures PDFStreamEngine to move the basic 
> functionality of invoking callbacks for each operator in a stream into a 
> parent class. The parent class knows nothing about the meaning of operators, 
> it just invokes handlers with accumulated arguments whenever it encounters an 
> operator. PDFStreamEngine retains all the "knowledge" of what those operators 
> mean, the state of the graphics state stack, etc.
> The purpose of the change is to make it simpler and easier to use PDFBox's 
> PDF stream processor/parser code without dealing with the full features of 
> PDFStreamEngine with its built-in operator handlers, awareness of the 
> graphics stack, etc when that functionality isn't required. Specifically, I 
> needed to write a tool that copies a PDF stream, renaming resource references 
> as it goes but otherwise leaving it unchanged. I wanted to handle all 
> operators including future or unknown ones, and only needed to special-case a 
> couple of them. PDFStreamEngine was poorly suited to that because it doesn't 
> support a default handler fallback, tries to "understand" the stream, etc. 
> Rather than write a new class that duplicated much of PDFStreamEngine I 
> thought I'd try to factor the required functionality out, so others could use 
> it too.
> The changes should be backward compatible with existing code that uses 
> PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were 
> required for the test suite to pass, text extraction tool to work, etc. 
> Nonetheless, it's possible you'll only consider these changes for inclusion 
> in PDFBox 2.0, in which case they can be cleaned up to remove some of the 
> backward compatibility crap that's currently in them. Let me know.
> In terms of open issues or TODOs, the class naming could probably use work. I 
> can't rename PDFStreamEngine or OperatorProcessor for backward compatibility 
> reasons, so I've had to come up with more contrived names than I'd like.
> The logic of the changes is:
> - Move content stream argument accumulation and operator callback 
> functionality into new PDFStreamProcessor class
> - Add support for a default (fallback) handler to PDFStreamProcessor so 
> operators not explicitly matched may be handled
> - Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its 
> existing methods though some are now inherited.
> - Deprecate the properties-map based configuration of PDFStreamEngine because 
> it'll be fragile whenever more than one classloader is in use. Add 
> PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent 
> functionality using a type-safe, multi-classloader-safe HashMap of operator 
> names to handler instances. This isn't added as a ctor override because 
> operator handler registration/unregistration methods are not final (to 
> preserve compatibility with PDFStreamEngine) and if overridden, they might 
> use data from a not-yet-initialized derived class. If a ctor override is 
> required then registerOperatorProcessor must be made final, breaking BC with 
> PDFStreamEngine.
> - Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). 
> Instances of this are bound to a particular PDFStreamEngine via the `context' 
> property and they carry state when they don't have to. They're also an 
> abstract class, so handlers can't extend any other class. OperatorProcessor 
> based handlers continue to be supported just fine via a simple wrapper that's 
> used automatically where required.
> - Introduce new OperatorProcessorHandler interface to replace 
> OperatorProcessor . It's a simple one-method interface that passes the 
> PDFStreamProcessor as an argument, so application designers are free to 
> choose whether to tie their OperationProcessorHandler implementations to 
> PDFStreamProcessor instances or whether they want to re-use the same handler 
> on many different processors. This change is useful for my app and removes 
> unnecessary stateful API, but isn't strictly necessary and can be dropped 
> while retaining the PDFStreamEngine / PDFStreamProcessor split. As part of 
> the API change, new-interface handlers are passed the original arguments 
> array rather than a copy; if they want a copy of the arguments array they 
> have to take it themselves, so that resources aren't wasted copying the array 
> when handlers don't actually need it copied.
> - Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor 
> implementations are wrapped in a helper that translates 
> OperatorProcessorHandler interface usage to the usage required by 
> OperatorProcessor. All the wrapper does is set the context (which 
> PDFStreamEngine seems to do before every handler call) then pass a copy of 
> the arguments array.
> I'm aware that this is a non-trivial change I'm proposing, but I think it 
> significantly improves the API (especially once the BC stuff can be removed 
> for PDFBox 2.0) and makes it easier to use this functionality.
> Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (PDFBOX-1256) [PATCH] Split PDFStreamEngine, moving functionality to simpler stream processor base class

Reply via email to