[ https://issues.apache.org/jira/browse/PDFBOX-1256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
John Hewson closed PDFBOX-1256. ------------------------------- Resolution: Won't Fix Fix Version/s: (was: 2.0.0) Closing, because the purpose of PDFStreamEngine is to execute custom operator handlers. If you just want the raw stream, use PDFStreamParser. > [PATCH] Split PDFStreamEngine, moving functionality to simpler stream > processor base class > ------------------------------------------------------------------------------------------ > > Key: PDFBOX-1256 > URL: https://issues.apache.org/jira/browse/PDFBOX-1256 > Project: PDFBox > Issue Type: Improvement > Components: PDModel > Affects Versions: 1.7.0, 2.0.0 > Environment: N/A > Reporter: Craig Ringer > Labels: api, refactoring, streams > Attachments: > 0002-New-PDFStreamProcessor-base-of-PDFStreamEngine-adds-.patch > > > The attached patch restructures PDFStreamEngine to move the basic > functionality of invoking callbacks for each operator in a stream into a > parent class. The parent class knows nothing about the meaning of operators, > it just invokes handlers with accumulated arguments whenever it encounters an > operator. PDFStreamEngine retains all the "knowledge" of what those operators > mean, the state of the graphics state stack, etc. > The purpose of the change is to make it simpler and easier to use PDFBox's > PDF stream processor/parser code without dealing with the full features of > PDFStreamEngine with its built-in operator handlers, awareness of the > graphics stack, etc when that functionality isn't required. Specifically, I > needed to write a tool that copies a PDF stream, renaming resource references > as it goes but otherwise leaving it unchanged. I wanted to handle all > operators including future or unknown ones, and only needed to special-case a > couple of them. PDFStreamEngine was poorly suited to that because it doesn't > support a default handler fallback, tries to "understand" the stream, etc. > Rather than write a new class that duplicated much of PDFStreamEngine I > thought I'd try to factor the required functionality out, so others could use > it too. > The changes should be backward compatible with existing code that uses > PDFStreamEngine. No changes in any PDFStreamEngine clients in PDFBox were > required for the test suite to pass, text extraction tool to work, etc. > Nonetheless, it's possible you'll only consider these changes for inclusion > in PDFBox 2.0, in which case they can be cleaned up to remove some of the > backward compatibility crap that's currently in them. Let me know. > In terms of open issues or TODOs, the class naming could probably use work. I > can't rename PDFStreamEngine or OperatorProcessor for backward compatibility > reasons, so I've had to come up with more contrived names than I'd like. > The logic of the changes is: > - Move content stream argument accumulation and operator callback > functionality into new PDFStreamProcessor class > - Add support for a default (fallback) handler to PDFStreamProcessor so > operators not explicitly matched may be handled > - Modify PDFStreamEngine to extend PDFStreamProcessor, retaining all its > existing methods though some are now inherited. > - Deprecate the properties-map based configuration of PDFStreamEngine because > it'll be fragile whenever more than one classloader is in use. Add > PDFStreamProcessor.replaceOperatorProcessors(...) for equivalent > functionality using a type-safe, multi-classloader-safe HashMap of operator > names to handler instances. This isn't added as a ctor override because > operator handler registration/unregistration methods are not final (to > preserve compatibility with PDFStreamEngine) and if overridden, they might > use data from a not-yet-initialized derived class. If a ctor override is > required then registerOperatorProcessor must be made final, breaking BC with > PDFStreamEngine. > - Deprecate OperatorProcessor (the PDFStreamEngine operator handler class). > Instances of this are bound to a particular PDFStreamEngine via the `context' > property and they carry state when they don't have to. They're also an > abstract class, so handlers can't extend any other class. OperatorProcessor > based handlers continue to be supported just fine via a simple wrapper that's > used automatically where required. > - Introduce new PDFStreamProcessor.OperatorHandler interface to replace > OperatorProcessor . It's a simple one-method interface that passes the > PDFStreamProcessor as an argument, so application designers are free to > choose whether to tie their OperationProcessorHandler implementations to > PDFStreamProcessor instances or whether they want to re-use the same handler > on many different processors. This change is useful for my app and removes > unnecessary stateful API, but isn't strictly necessary and can be dropped > while retaining the PDFStreamEngine / PDFStreamProcessor split. As part of > the API change, new-interface handlers are passed the original arguments > array rather than a copy; if they want a copy of the arguments array they > have to take it themselves, so that resources aren't wasted copying the array > when handlers don't actually need it copied. > - Add compatibility code to PDFStreamEngine to ensure that OperatorProcessor > implementations are wrapped in a helper that translates > OperatorProcessorHandler interface usage to the usage required by > OperatorProcessor. All the wrapper does is set the context (which > PDFStreamEngine seems to do before every handler call) then pass a copy of > the arguments array. > I'm aware that this is a non-trivial change I'm proposing, but I think it > significantly improves the API (especially once the BC stuff can be removed > for PDFBox 2.0) and makes it easier to use this functionality. > Prior patch in series (should be independent of this one): > https://issues.apache.org/jira/browse/PDFBOX-1255 > Next patch in series: https://issues.apache.org/jira/browse/PDFBOX-1263 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org