[
https://issues.apache.org/jira/browse/PDFBOX-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jukka Zitting resolved PDFBOX-604.
----------------------------------
Resolution: Fixed
Fix Version/s: 1.0.0
Assignee: Jukka Zitting
See revisions 899802, 899804, 899806, 899807 and 899810 for the improvements I
made. This covers pretty much all of the remaining immediate simple bottlenecks
I could find through profiling, so I'm resolving this issue as fixed.
The biggest higher level performance bottleneck is the way
o.a.p.util.PDFStreamEngine.processEncodedText() processes each glyph
separately. We would likely see major performance improvements if we refactor
things so that the entire
string of encoded glyphs is first decoded as a single operation and then any
graphics transformations are applied to
that whole block before processing the characters. That, however, is best
handled as a separate issue.
> Various text extraction performance improvements
> ------------------------------------------------
>
> Key: PDFBOX-604
> URL: https://issues.apache.org/jira/browse/PDFBOX-604
> Project: PDFBox
> Issue Type: Improvement
> Components: Text extraction
> Reporter: Jukka Zitting
> Assignee: Jukka Zitting
> Fix For: 1.0.0
>
>
> Even after Mel's recent patches I've found a number of small performance
> bottlenecks that we could get rid of.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.