[jira] Resolved: (PDFBOX-604) Various text extraction performance improvements

Jukka Zitting (JIRA) Mon, 18 Jan 2010 02:44:20 -0800

     [ 
https://issues.apache.org/jira/browse/PDFBOX-604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved PDFBOX-604.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.0.0
         Assignee: Jukka Zitting

See revisions 899802, 899804, 899806, 899807 and 899810 for the improvements I 
made. This covers pretty much all of the remaining immediate simple bottlenecks 
I could find through profiling, so I'm resolving this issue as fixed.

The biggest higher level performance bottleneck is the way 
o.a.p.util.PDFStreamEngine.processEncodedText() processes each glyph 
separately. We would likely see major performance improvements if we refactor 
things so that the entire
string of encoded glyphs is first decoded as a single operation and then any 
graphics transformations are applied to
that whole block before processing the characters. That, however, is best 
handled as a separate issue.

> Various text extraction performance improvements
> ------------------------------------------------
>
>                 Key: PDFBOX-604
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-604
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Text extraction
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>             Fix For: 1.0.0
>
>
> Even after Mel's recent patches I've found a number of small performance 
> bottlenecks that we could get rid of.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (PDFBOX-604) Various text extraction performance improvements

Reply via email to