Axel Howind created PDFBOX-6002:
-----------------------------------
Summary: change parse methods to take CharSequence argument
Key: PDFBOX-6002
URL: https://issues.apache.org/jira/browse/PDFBOX-6002
Project: PDFBox
Issue Type: Improvement
Reporter: Axel Howind
Attachments: image-2025-05-02-07-00-52-161.png
PDFBox parsing works on Strings in almost all places. Often, StringBuilder
instances are created to prepare a fragment to parse, and then another parse
method is called using the result of calling toString() on the StringBuilder.
If the parse methods were changed to take CharSequence instead, the
StringBuilder instance could be passed on without creating a temporary String
instance. This would reduce memory consumption and load on the GC.
I did some profiling using the async profiler, and for example in
BaseParser.parseCOSNumber() about 25% of the runtime is spent in
StringBuilder().toString() which would be completely eliminated if the parse
methods worked on CharSequences instead of Strings (see image):
!image-2025-05-02-07-00-52-161.png!
A consequence would be that user code needs to be recompiled (no code changes
on the user side) against the new version because the method signature changes.
An alternative approach is to introduce new methods with the prefix CS, like
parseCOSNumberCS(), and to delegate parseCOSNumber() to the new method. This
would be a PDFBox 3 compatible change.
Please let me know if, and if yes, which version of a patch you would possibly
accept. I'd then create incremental patches to provide this functionality.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]