[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848357#comment-17848357 ]
Jonathan Prates edited comment on PDFBOX-5823 at 5/21/24 7:25 PM: ------------------------------------------------------------------ I've attached a profiler screenshot and seems like predicate (even static and creating only once) is not a good option. Do you think you can compare in your side as well? Please, if you don't mind, have a look at Main-1.java and Screenshot 2024-05-21 at 20.21.43.png. Perhaps I'm missing something. was (Author: JIRAUSER305510): I've attached a profiler screenshot and seems like predicate (even static and creating only once) is not a good option. Do you think you can compare in your side as well? > StringUtil.PATTERN_SPACE memory optmisation > ------------------------------------------- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel > Affects Versions: 3.0.3 PDFBox > Reporter: Jonathan Prates > Assignee: Andreas Lehmkühler > Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org