[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-06-26 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17860122#comment-17860122
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

hi [~lehmi] is there any estimated date for the 3.0.3 to go live?

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at 
> 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at 
> 20.21.43.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-23 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848978#comment-17848978
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

[~lehmi] thanks! this alternative solves the memory issue.

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at 
> 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at 
> 20.21.43.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848464#comment-17848464
 ] 

Andreas Lehmkühler commented on PDFBOX-5823:


I didn't have to dig too deep to find out that I'm wrong. Every usage of the 
predicate function created a new Matcher object. I've followed [~msahyoun] 
proposal and replaced the predicate with a simplified version of 
StringUtils.isBlank from commons-lang

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at 
> 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at 
> 20.21.43.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848463#comment-17848463
 ] 

ASF subversion and git services commented on PDFBOX-5823:
-

Commit 1917878 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1917878 ]

PDFBOX-5823: replace Predicate to avoid creating new objects with every call

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at 
> 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at 
> 20.21.43.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848392#comment-17848392
 ] 

Andreas Lehmkühler commented on PDFBOX-5823:


Looks like I'm missing something. I'm going to have a deeper look

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at 
> 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at 
> 20.21.43.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848357#comment-17848357
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

I've attached a profiler screenshot and seems like predicate (even static and 
creating only once) is not a good option. Do you think you can compare in your 
side as well?

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main-1.java, Main.java, Screenshot 2024-05-19 at 
> 22.39.10.png, Screenshot 2024-05-19 at 22.40.17.png, Screenshot 2024-05-21 at 
> 20.21.43.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848344#comment-17848344
 ] 

Andreas Lehmkühler commented on PDFBOX-5823:


[~thumbox] yes, but the matcher is static and created only once

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848329#comment-17848329
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

[~lehmi] I believe asPredicate() will instantiate a Matcher, that could cause 
the same high memory utilisation. 

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848320#comment-17848320
 ] 

ASF subversion and git services commented on PDFBOX-5823:
-

Commit 1917862 from le...@apache.org in branch 'pdfbox/branches/3.0'
[ https://svn.apache.org/r1917862 ]

PDFBOX-5823: simplify pattern matching to optimize memory consumption based on 
a proposal by Jonathan Prates

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848319#comment-17848319
 ] 

Andreas Lehmkühler commented on PDFBOX-5823:


Thanks for the proposals but I've found another solution

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848277#comment-17848277
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

Agree, we could copy StringUtils.isBlank() code 
[https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L3623C1-L3634C6]
 or something like
{code:java}
public boolean isBlank(String s)
{
return s != null && s.chars().allMatch(Character::isWhitespace);
}{code}

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Maruan Sahyoun (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848274#comment-17848274
 ] 

Maruan Sahyoun commented on PDFBOX-5823:


What about using Apache Commons Lang StringUtils.isBlank() or copy the code?

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848271#comment-17848271
 ] 

Andreas Lehmkühler commented on PDFBOX-5823:


[~thumbox] we need to find another solution for 3.x as String.isBlank() isn't 
available in java8

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-21 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17848268#comment-17848268
 ] 

ASF subversion and git services commented on PDFBOX-5823:
-

Commit 1917858 from le...@apache.org in branch 'pdfbox/trunk'
[ https://svn.apache.org/r1917858 ]

PDFBOX-5823: simplify pattern matching to optimize memory consumption as 
proposed by Jonathan Prates

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Assignee: Andreas Lehmkühler
>Priority: Minor
> Fix For: 3.0.3 PDFBox, 4.0.0
>
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-20 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847912#comment-17847912
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

[~lehmi] I tested it locally and indeed it is way better if \x0B can be ignored
{code:java}
word.length() == 1 && word.isBlank(); {code}

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Priority: Minor
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-20 Thread Jira


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847901#comment-17847901
 ] 

Andreas Lehmkühler commented on PDFBOX-5823:


Those tokens either doesn't contain any of that chars or exactly one of them. 
Saying that, it might be a good idea to check only those tokens for "spaces" 
with a length of 1

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Priority: Minor
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-20 Thread Jonathan Prates (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847855#comment-17847855
 ] 

Jonathan Prates commented on PDFBOX-5823:
-

Sure, I mean, contains() is slower for big strings, but not for small ones. My 
suggestion is to use a set, in order to avoid memory allocation and resolve in 
O ( 1 ) time.

 
`var SPACES_SET = Set.of(" ", "\t", "\n", "\r", "\f", "\\x0B");`
 
Attached I've provided a simple benchmark:
 
[^Main.java]

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Priority: Minor
> Attachments: Main.java, Screenshot 2024-05-19 at 22.39.10.png, 
> Screenshot 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org



[jira] [Commented] (PDFBOX-5823) StringUtil.PATTERN_SPACE memory optmisation

2024-05-20 Thread Tilman Hausherr (Jira)


[ 
https://issues.apache.org/jira/browse/PDFBOX-5823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847844#comment-17847844
 ] 

Tilman Hausherr commented on PDFBOX-5823:
-

Isn't your solution slower? It would have to go through the whole string 
several times. Re memory, isn't this cleaned in garbage collection if new 
memory is needed?

> StringUtil.PATTERN_SPACE memory optmisation
> ---
>
> Key: PDFBOX-5823
> URL: https://issues.apache.org/jira/browse/PDFBOX-5823
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 3.0.3 PDFBox
>Reporter: Jonathan Prates
>Priority: Minor
> Attachments: Screenshot 2024-05-19 at 22.39.10.png, Screenshot 
> 2024-05-19 at 22.40.17.png
>
>
> PDAbstractContentStream uses StringUtil.PATTERN_SPACE regexp to evaluate if a 
> word has a space in it 
> ([https://github.com/apache/pdfbox/blob/trunk/pdfbox/src/main/java/org/apache/pdfbox/pdmodel/PDAbstractContentStream.java#L1624])
> For large documents ~800 pages and small string sequences (like a regular 
> word), it causes a memory overhead (see attached), due to the several extra 
> allocations. I've replaced the regexp for space and \t using word.contains, 
> and since it's a O ( 1 ) operation that does not require extra allocations, 
> memory used has been reduced.
> What would be the implications of replacing this block for contains()?
> Since \s is [ \t\n\x0B\f\r], I believe we have a simplified version to 
> allocate less memory.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org