[jira] [Commented] (PDFBOX-5606) PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

Jira Mon, 22 May 2023 23:41:06 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725245#comment-17725245
 ]


Andreas Lehmkühler commented on PDFBOX-5606:
--------------------------------------------

The changes from PDFBOX-5534 are OK. They reveal an issue with the content 
stream parser. It didn't close the underlying resources after reading is done. 
I've adapted the behaviour from the trunk, where the resources are closed once 
the last token is read.  The resources are closed as well if a fatal error 
occurs. That part was missing in 3.0.0 too, so that I've added it to the trunk.

I've introduced a public method to close the resources if someone uses the 
parser but doesn't read the stream until the end, so that the user is able to 
close the resources.



> PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code
> ------------------------------------------------------------------------
>
>                 Key: PDFBOX-5606
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-5606
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 2.0.28
>            Reporter: Joe Li
>            Assignee: Andreas Lehmkühler
>            Priority: Major
>              Labels: memory-bug
>             Fix For: 2.0.29
>
>         Attachments: 590031dc-2131-4a00-a936-d1175b7b926c.pdf, 
> pdfbox-2.0.27.png, pdfbox-2.0.28.png, screenshot-1.png, screenshot-2.png
>
>
> Given the follwing simplified Groovy code (for succinctness over Java)
>  
> {code:java}
> // Groovy 4.0.12
> import org.apache.pdfbox.pdmodel.PDDocument
> import org.apache.pdfbox.pdmodel.PDPage
> import org.apache.pdfbox.text.PDFTextStripperByArea
> import java.awt.geom.Rectangle2D
> int GRID_WIDTH = 10
> int GRID_HEIGHT = 10
> PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
>     doc.pages.eachWithIndex { PDPage page, int pageIndex ->
>         int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
>         int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
>         println "processing page $pageIndex, rows = $rows, columns = $columns"
>         def rectangles = [:]
>         (0..<rows).each {rowIndex ->
>             (0..<columns).each { colIndex ->
>                 rectangles["${rowIndex * columns + colIndex}"] = new 
> Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, 
> GRID_HEIGHT)
>             }
>         }
>         rectangles.each { key, rect ->
>             PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
>             textStripper.addRegion(key, rect)
>             textStripper.extractRegions(page)
>         }
>     }
> }{code}
>  
>  
> PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does 
> not. 
> The test.pdf file I am using can be downloaded from Apple SEC filings page, 
> `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ 
> page pdf with a lot of text will work. 
> I have attached profiler screenshots of the difference. 
> Thanks in advance for your help. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

[jira] [Commented] (PDFBOX-5606) PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code

Reply via email to