[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860122#comment-17860122 ] Jonathan Prates commented on PDFBOX-5823: - hi [~lehmi] is there any estimated date for the 3.0.3 to go live? > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848978#comment-17848978 ] Jonathan Prates commented on PDFBOX-5823: - [~lehmi] thanks! this alternative solves the memory issue. > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848464#comment-17848464 ] Andreas Lehmkühler commented on PDFBOX-5823: I didn't have to dig too deep to find out that I'm wrong. Every usage of the predicate function created a new Matcher object. I've followed [~msahyoun] proposal and replaced the predicate with a simplified version of StringUtils.isBlank from commons-lang > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848463#comment-17848463 ] ASF subversion and git services commented on PDFBOX-5823: - Commit 1917878 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1917878 ] PDFBOX-5823: replace Predicate to avoid creating new objects with every call > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848392#comment-17848392 ] Andreas Lehmkühler commented on PDFBOX-5823: Looks like I'm missing something. I'm going to have a deeper look > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848357#comment-17848357 ] Jonathan Prates commented on PDFBOX-5823: - I've attached a profiler screenshot and seems like predicate (even static and creating only once) is not a good option. Do you think you can compare in your side as well? > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at > 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at > 20.21.43.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848344#comment-17848344 ] Andreas Lehmkühler commented on PDFBOX-5823: [~thumbox] yes, but the matcher is static and created only once > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848329#comment-17848329 ] Jonathan Prates commented on PDFBOX-5823: - [~lehmi] I believe asPredicate() will instantiate a Matcher, that could cause the same high memory utilisation. > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848320#comment-17848320 ] ASF subversion and git services commented on PDFBOX-5823: - Commit 1917862 from le...@apache.org in branch 'pdfbox/branches/3.0' [ https://svn.apache.org/r1917862 ] PDFBOX-5823: simplify pattern matching to optimize memory consumption based on a proposal by Jonathan Prates > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848319#comment-17848319 ] Andreas Lehmkühler commented on PDFBOX-5823: Thanks for the proposals but I've found another solution > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848277#comment-17848277 ] Jonathan Prates commented on PDFBOX-5823: - Agree, we could copy StringUtils.isBlank() code [https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L3623C1-L3634C6] or something like {code:java} public boolean isBlank(String s) { return s != null && s.chars().allMatch(Character::isWhitespace); }{code} > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848274#comment-17848274 ] Maruan Sahyoun commented on PDFBOX-5823: What about using Apache Commons Lang StringUtils.isBlank() or copy the code? > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848271#comment-17848271 ] Andreas Lehmkühler commented on PDFBOX-5823: [~thumbox] we need to find another solution for 3.x as String.isBlank() isn't available in java8 > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848268#comment-17848268 ] ASF subversion and git services commented on PDFBOX-5823: - Commit 1917858 from le...@apache.org in branch 'pdfbox/trunk' [ https://svn.apache.org/r1917858 ] PDFBOX-5823: simplify pattern matching to optimize memory consumption as proposed by Jonathan Prates > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Assignee: Andreas Lehmkühler >Priority: Minor > Fix For: 3.0.3 PDFBox, 4.0.0 > > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847912#comment-17847912 ] Jonathan Prates commented on PDFBOX-5823: - [~lehmi] I tested it locally and indeed it is way better if \x0B can be ignored {code:java} word.length() == 1 && word.isBlank(); {code} > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Priority: Minor > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847901#comment-17847901 ] Andreas Lehmkühler commented on PDFBOX-5823: Those tokens either doesn't contain any of that chars or exactly one of them. Saying that, it might be a good idea to check only those tokens for "spaces" with a length of 1 > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Priority: Minor > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847855#comment-17847855 ] Jonathan Prates commented on PDFBOX-5823: - Sure, I mean, contains() is slower for big strings, but not for small ones. My suggestion is to use a set, in order to avoid memory allocation and resolve in O ( 1 ) time. `var SPACES_SET = Set.of(" ", "\t", "\n", "\r", "\f", "\\x0B");` Attached I've provided a simple benchmark: [^Main.java] > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Priority: Minor > Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, > Screenshot 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org
[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation
[ https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847844#comment-17847844 ] Tilman Hausherr commented on PDFBOX-5823: - Isn't your solution slower? It would have to go through the whole string several times. Re memory, isn't this cleaned in garbage collection if new memory is needed? > StringUtil.PATTERN_SPACE memory optmisation > --- > > Key: PDFBOX-5823 > URL: https://issues.apache.org/jira/browse/PDFBOX-5823 > Project: PDFBox > Issue Type: Improvement > Components: PDModel >Affects Versions: 3.0.3 PDFBox >Reporter: Jonathan Prates >Priority: Minor > Attachments: Screenshot 2024-05-19 at 22.39.10.png, Screenshot > 2024-05-19 at 22.40.17.png > > > PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a > word has a space in it > ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624]) > For large documents ~800 pages and small string sequences (like a regular > word), it causes a memory overhead (see attached), due to the several extra > allocations. I've replaced the regexp for space and \t using word.contains, > and since it's a O ( 1 ) operation that does not require extra allocations, > memory used has been reduced. > What would be the implications of replacing this block for contains()? > Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to > allocate less memory. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org