Jonathan Prates created PDFBOX-5823:
---------------------------------------

             Summary: StringUtil.PATTERN_SPACE memory optmisation
                 Key: PDFBOX-5823
                 URL: https://issues.apache.org/jira/browse/PDFBOX-5823
             Project: PDFBox
          Issue Type: Improvement
          Components: PDModel
    Affects Versions: 3.0.3 PDFBox
            Reporter: Jonathan Prates
         Attachments: Screenshot 2024-05-19 at 22.39.10.png, Screenshot 
2024-05-19 at 22.40.17.png

PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
word has a space in it 
([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])

For large documents ~800 pages and small string sequences (like a regular 
word), it causes a memory overhead (see attached), due to the several extra 
allocations. I've replaced the regexp for space and \t using word.contains, and 
since it's a O(n) operation that does not require extra allocations, memory 
used has been reduced.

What would be the implications of replacing this block for contains()?

Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to allocate 
less memory.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to