Joe Li created PDFBOX-5606:
------------------------------
Summary: PDFTextStripper runs out of memory in 2.0.28 but not in
2.0.27 same code
Key: PDFBOX-5606
URL: https://issues.apache.org/jira/browse/PDFBOX-5606
Project: PDFBox
Issue Type: Bug
Affects Versions: 2.0.28
Reporter: Joe Li
Attachments: pdfbox-2.0.27.png, pdfbox-2.0.28.png
Given the follwing simplified Groovy code (for succinctness over Java)
{code:java}
// Groovy 4.0.12
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.pdmodel.PDPage
import org.apache.pdfbox.text.PDFTextStripperByArea
import java.awt.geom.Rectangle2D
int GRID_WIDTH = 10
int GRID_HEIGHT = 10
PDDocument.load(new File('./test.pdf')).withCloseable { doc ->
doc.pages.eachWithIndex { PDPage page, int pageIndex ->
int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT)
int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH)
println "processing page $pageIndex, rows = $rows, columns = $columns"
def rectangles = [:]
(0..<rows).each {rowIndex ->
(0..<columns).each { colIndex ->
rectangles["${rowIndex * columns + colIndex}"] = new
Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH,
GRID_HEIGHT)
}
}
rectangles.each { key, rect ->
PDFTextStripperByArea textStripper = new PDFTextStripperByArea()
textStripper.addRegion(key, rect)
textStripper.extractRegions(page)
}
}
}{code}
PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does not.
The test.pdf file I am using can be downloaded from Apple SEC filings page,
`8-K` from [here |[https://investor.apple.com/sec-filings/default.aspx],] but
any 10+ page pdf with a lot of text will work.
I have attached profiler screenshots of the difference.
Thanks in advance for your help.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]