Jonathan Prates created PDFBOX-5823: ---------------------------------------
Summary: StringUtil.PATTERN_SPACE memory optmisation Key: PDFBOX-5823 URL: https://issues.apache.org/jira/browse/PDFBOX-5823 Project: PDFBox Issue Type: Improvement Components: PDModel Affects Versions: 3.0.3 PDFBox Reporter: Jonathan Prates Attachments: Screenshot 2024-05-19 at 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a word has a space in it ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) For large documents ~800 pages and small string sequences (like a regular word), it causes a memory overhead (see attached), due to the several extra allocations. I've replaced the regexp for space and \t using word.contains, and since it's a O(n) operation that does not require extra allocations, memory used has been reduced. What would be the implications of replacing this block for contains()? Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to allocate less memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org