[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860122#comment-17860122 ]
Jonathan Prates commented on PDFBOX-5823: ----------------------------------------- hi [~lehmi] is there any estimated date for the 3.0.3 to go live? > StringUtil.PATTERN_SPACE memory optmisation > ------------------------------------------- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel > Affects Versions: 3.0.3 PDFBox > Reporter: Jonathan Prates > Assignee: Andreas Lehmkühler > Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org