[
https://issues.apache.org/jira/browse/PDFBOX-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Craig Ringer updated PDFBOX-1256:
---------------------------------
Description:
The attached patch restructures PDFStreamEngine to move the basic functionality
of invoking callbacks for each operator in a stream into a parent class. The
parent class knows nothing about the meaning of operators, it just invokes
handlers with accumulated arguments whenever it encounters an operator.
PDFStreamEngine retains all the "knowledge" of what those operators mean, the
state of the graphics state stack, etc.
The purpose of the change is to make it simpler and easier to use PDFBox's PDF
stream processor/parser code without dealing with the full features of
PDFStreamEngine with its built-in operator handlers, awareness of the graphics
stack, etc when that functionality isn't required. Specifically, I needed to
write a tool that copies a PDF stream, renaming resource references as it goes
but otherwise leaving it unchanged. I wanted to handle all operators including
future or unknown ones, and only needed to special-case a couple of them.
PDFStreamEngine was poorly suited to that because it doesn't support a default
handler fallback, tries to "understand" the stream, etc. Rather than write a
new class that duplicated much of PDFStreamEngine I thought I'd try to factor
the required functionality out, so others could use it too.
The changes should be backward compatible with existing code that uses
PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were
required for the test suite to pass, text extraction tool to work, etc.
Nonetheless, it's possible you'll only consider these changes for inclusion in
PDFBox 2.0, in which case they can be cleaned up to remove some of the backward
compatibility crap that's currently in them. Let me know.
In terms of open issues or TODOs, the class naming could probably use work. I
can't rename PDFStreamEngine or OperatorProcessor for backward compatibility
reasons, so I've had to come up with more contrived names than I'd like.
The logic of the changes is:
- Move content stream argument accumulation and operator callback functionality
into new PDFStreamProcessor class
- Add support for a default (fallback) handler to PDFStreamProcessor so
operators not explicitly matched may be handled
- Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its
existing methods though some are now inherited.
- Deprecate the properties-map based configuration of PDFStreamEngine because
it'll be fragile whenever more than one classloader is in use. Add
PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent functionality
using a type-safe, multi-classloader-safe HashMap of operator names to handler
instances. This isn't added as a ctor override because operator handler
registration/unregistration methods are not final (to preserve compatibility
with PDFStreamEngine) and if overridden, they might use data from a
not-yet-initialized derived class. If a ctor override is required then
registerOperatorProcessor must be made final, breaking BC with PDFStreamEngine.
- Deprecate OperatorProcessor (the PDFStreamEngine operator handler class).
Instances of this are bound to a particular PDFStreamEngine via the `context'
property and they carry state when they don't have to. They're also an abstract
class, so handlers can't extend any other class. OperatorProcessor based
handlers continue to be supported just fine via a simple wrapper that's used
automatically where required.
- Introduce new OperatorProcessorHandler interface to replace OperatorProcessor
. It's a simple one-method interface that passes the PDFStreamProcessor as an
argument, so application designers are free to choose whether to tie their
OperationProcessorHandler implementations to PDFStreamProcessor instances or
whether they want to re-use the same handler on many different processors. This
change is useful for my app and removes unnecessary stateful API, but isn't
strictly necessary and can be dropped while retaining the PDFStreamEngine /
PDFStreamProcessor split. As part of the API change, new-interface handlers are
passed the original arguments array rather than a copy; if they want a copy of
the arguments array they have to take it themselves, so that resources aren't
wasted copying the array when handlers don't actually need it copied.
- Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor
implementations are wrapped in a helper that translates
OperatorProcessorHandler interface usage to the usage required by
OperatorProcessor. All the wrapper does is set the context (which
PDFStreamEngine seems to do before every handler call) then pass a copy of the
arguments array.
I'm aware that this is a non-trivial change I'm proposing, but I think it
significantly improves the API (especially once the BC stuff can be removed for
PDFBox 2.0) and makes it easier to use this functionality.
Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263
was:
The attached patch restructures PDFStreamEngine to move the basic functionality
of invoking callbacks for each operator in a stream into a parent class. The
parent class knows nothing about the meaning of operators, it just invokes
handlers with accumulated arguments whenever it encounters an operator.
PDFStreamEngine retains all the "knowledge" of what those operators mean, the
state of the graphics state stack, etc.
The purpose of the change is to make it simpler and easier to use PDFBox's PDF
stream processor/parser code without dealing with the full features of
PDFStreamEngine with its built-in operator handlers, awareness of the graphics
stack, etc when that functionality isn't required. Specifically, I needed to
write a tool that copies a PDF stream, renaming resource references as it goes
but otherwise leaving it unchanged. I wanted to handle all operators including
future or unknown ones, and only needed to special-case a couple of them.
PDFStreamEngine was poorly suited to that because it doesn't support a default
handler fallback, tries to "understand" the stream, etc. Rather than write a
new class that duplicated much of PDFStreamEngine I thought I'd try to factor
the required functionality out, so others could use it too.
The changes should be backward compatible with existing code that uses
PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were
required for the test suite to pass, text extraction tool to work, etc.
Nonetheless, it's possible you'll only consider these changes for inclusion in
PDFBox 2.0, in which case they can be cleaned up to remove some of the backward
compatibility crap that's currently in them. Let me know.
In terms of open issues or TODOs, the class naming could probably use work. I
can't rename PDFStreamEngine or OperatorProcessor for backward compatibility
reasons, so I've had to come up with more contrived names than I'd like.
The logic of the changes is:
- Move content stream argument accumulation and operator callback functionality
into new PDFStreamProcessor class
- Add support for a default (fallback) handler to PDFStreamProcessor so
operators not explicitly matched may be handled
- Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its
existing methods though some are now inherited.
- Deprecate the properties-map based configuration of PDFStreamEngine because
it'll be fragile whenever more than one classloader is in use. Add
PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent functionality
using a type-safe, multi-classloader-safe HashMap of operator names to handler
instances. This isn't added as a ctor override because operator handler
registration/unregistration methods are not final (to preserve compatibility
with PDFStreamEngine) and if overridden, they might use data from a
not-yet-initialized derived class. If a ctor override is required then
registerOperatorProcessor must be made final, breaking BC with PDFStreamEngine.
- Deprecate OperatorProcessor (the PDFStreamEngine operator handler class).
Instances of this are bound to a particular PDFStreamEngine via the `context'
property and they carry state when they don't have to. They're also an abstract
class, so handlers can't extend any other class. OperatorProcessor based
handlers continue to be supported just fine via a simple wrapper that's used
automatically where required.
- Introduce new OperatorProcessorHandler interface to replace OperatorProcessor
. It's a simple one-method interface that passes the PDFStreamProcessor as an
argument, so application designers are free to choose whether to tie their
OperationProcessorHandler implementations to PDFStreamProcessor instances or
whether they want to re-use the same handler on many different processors. This
change is useful for my app and removes unnecessary stateful API, but isn't
strictly necessary and can be dropped while retaining the PDFStreamEngine /
PDFStreamProcessor split. As part of the API change, new-interface handlers are
passed the original arguments array rather than a copy; if they want a copy of
the arguments array they have to take it themselves, so that resources aren't
wasted copying the array when handlers don't actually need it copied.
- Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor
implementations are wrapped in a helper that translates
OperatorProcessorHandler interface usage to the usage required by
OperatorProcessor. All the wrapper does is set the context (which
PDFStreamEngine seems to do before every handler call) then pass a copy of the
arguments array.
I'm aware that this is a non-trivial change I'm proposing, but I think it
significantly improves the API (especially once the BC stuff can be removed for
PDFBox 2.0) and makes it easier to use this functionality.
> [PATCH] Split PDFStreamEngine, moving functionality to simpler stream
> processor base class
> ------------------------------------------------------------------------------------------
>
> Key: PDFBOX-1256
> URL: https://issues.apache.org/jira/browse/PDFBOX-1256
> Project: PDFBox
> Issue Type: Improvement
> Components: Parsing, Text extraction
> Affects Versions: 1.7.0, 2.0.0
> Environment: N/A
> Reporter: Craig Ringer
> Priority: Minor
> Labels: api, refactoring
> Attachments:
> 0002-New-PDFStreamProcessor-base-of-PDFStreamEngine-adds-.patch
>
>
> The attached patch restructures PDFStreamEngine to move the basic
> functionality of invoking callbacks for each operator in a stream into a
> parent class. The parent class knows nothing about the meaning of operators,
> it just invokes handlers with accumulated arguments whenever it encounters an
> operator. PDFStreamEngine retains all the "knowledge" of what those operators
> mean, the state of the graphics state stack, etc.
> The purpose of the change is to make it simpler and easier to use PDFBox's
> PDF stream processor/parser code without dealing with the full features of
> PDFStreamEngine with its built-in operator handlers, awareness of the
> graphics stack, etc when that functionality isn't required. Specifically, I
> needed to write a tool that copies a PDF stream, renaming resource references
> as it goes but otherwise leaving it unchanged. I wanted to handle all
> operators including future or unknown ones, and only needed to special-case a
> couple of them. PDFStreamEngine was poorly suited to that because it doesn't
> support a default handler fallback, tries to "understand" the stream, etc.
> Rather than write a new class that duplicated much of PDFStreamEngine I
> thought I'd try to factor the required functionality out, so others could use
> it too.
> The changes should be backward compatible with existing code that uses
> PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were
> required for the test suite to pass, text extraction tool to work, etc.
> Nonetheless, it's possible you'll only consider these changes for inclusion
> in PDFBox 2.0, in which case they can be cleaned up to remove some of the
> backward compatibility crap that's currently in them. Let me know.
> In terms of open issues or TODOs, the class naming could probably use work. I
> can't rename PDFStreamEngine or OperatorProcessor for backward compatibility
> reasons, so I've had to come up with more contrived names than I'd like.
> The logic of the changes is:
> - Move content stream argument accumulation and operator callback
> functionality into new PDFStreamProcessor class
> - Add support for a default (fallback) handler to PDFStreamProcessor so
> operators not explicitly matched may be handled
> - Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its
> existing methods though some are now inherited.
> - Deprecate the properties-map based configuration of PDFStreamEngine because
> it'll be fragile whenever more than one classloader is in use. Add
> PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent
> functionality using a type-safe, multi-classloader-safe HashMap of operator
> names to handler instances. This isn't added as a ctor override because
> operator handler registration/unregistration methods are not final (to
> preserve compatibility with PDFStreamEngine) and if overridden, they might
> use data from a not-yet-initialized derived class. If a ctor override is
> required then registerOperatorProcessor must be made final, breaking BC with
> PDFStreamEngine.
> - Deprecate OperatorProcessor (the PDFStreamEngine operator handler class).
> Instances of this are bound to a particular PDFStreamEngine via the `context'
> property and they carry state when they don't have to. They're also an
> abstract class, so handlers can't extend any other class. OperatorProcessor
> based handlers continue to be supported just fine via a simple wrapper that's
> used automatically where required.
> - Introduce new OperatorProcessorHandler interface to replace
> OperatorProcessor . It's a simple one-method interface that passes the
> PDFStreamProcessor as an argument, so application designers are free to
> choose whether to tie their OperationProcessorHandler implementations to
> PDFStreamProcessor instances or whether they want to re-use the same handler
> on many different processors. This change is useful for my app and removes
> unnecessary stateful API, but isn't strictly necessary and can be dropped
> while retaining the PDFStreamEngine / PDFStreamProcessor split. As part of
> the API change, new-interface handlers are passed the original arguments
> array rather than a copy; if they want a copy of the arguments array they
> have to take it themselves, so that resources aren't wasted copying the array
> when handlers don't actually need it copied.
> - Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor
> implementations are wrapped in a helper that translates
> OperatorProcessorHandler interface usage to the usage required by
> OperatorProcessor. All the wrapper does is set the context (which
> PDFStreamEngine seems to do before every handler call) then pass a copy of
> the arguments array.
> I'm aware that this is a non-trivial change I'm proposing, but I think it
> significantly improves the API (especially once the BC stuff can be removed
> for PDFBox 2.0) and makes it easier to use this functionality.
> Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira