[ https://issues.apache.org/jira/browse/PDFBOX-5606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17725037#comment-17725037 ]
Joe Li commented on PDFBOX-5606: -------------------------------- [~tilman] Below is the java code. Please change the pdf file path to the actual location before running it. Thanks! {code:java} import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.text.PDFTextStripperByArea; import java.awt.geom.Rectangle2D; import java.io.File; public class App { private static final int GRID_WIDTH = 10; private static final int GRID_HEIGHT = 10; public static void main(String[] args) { try { PDDocument doc = PDDocument.load(new File("/590031dc-2131-4a00-a936-d1175b7b926c.pdf")); for (int pageIndex = 0; pageIndex < doc.getNumberOfPages(); pageIndex++) { PDPage page = doc.getPage(pageIndex); int rows = (int) Math.ceil(page.getMediaBox().getHeight() /GRID_HEIGHT); int columns = (int) Math.ceil(page.getMediaBox().getWidth() /GRID_WIDTH); System.out.println("processing page " + (pageIndex + 1) + ", rows = " + rows + ", columns = " + columns); for (int rowIndex = 0; rowIndex < rows; rowIndex++) { for (int colIndex = 0; colIndex < columns; colIndex++) { PDFTextStripperByArea textStripper = new PDFTextStripperByArea(); textStripper.addRegion(Integer.toString(rowIndex * columns + colIndex), new Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, GRID_HEIGHT)); textStripper.extractRegions(page); } } } doc.close(); } catch (Exception e) { System.out.println(e); } } } {code} > PDFTextStripper runs out of memory in 2.0.28 but not in 2.0.27 same code > ------------------------------------------------------------------------ > > Key: PDFBOX-5606 > URL: https://issues.apache.org/jira/browse/PDFBOX-5606 > Project: PDFBox > Issue Type: Bug > Affects Versions: 2.0.28 > Reporter: Joe Li > Priority: Major > Labels: memory-bug > Attachments: 590031dc-2131-4a00-a936-d1175b7b926c.pdf, > pdfbox-2.0.27.png, pdfbox-2.0.28.png > > > Given the follwing simplified Groovy code (for succinctness over Java) > > {code:java} > // Groovy 4.0.12 > import org.apache.pdfbox.pdmodel.PDDocument > import org.apache.pdfbox.pdmodel.PDPage > import org.apache.pdfbox.text.PDFTextStripperByArea > import java.awt.geom.Rectangle2D > int GRID_WIDTH = 10 > int GRID_HEIGHT = 10 > PDDocument.load(new File('./test.pdf')).withCloseable { doc -> > doc.pages.eachWithIndex { PDPage page, int pageIndex -> > int rows = Math.ceil((page.mediaBox.height as int) /GRID_HEIGHT) > int columns = Math.ceil((page.mediaBox.width as int) /GRID_WIDTH) > println "processing page $pageIndex, rows = $rows, columns = $columns" > def rectangles = [:] > (0..<rows).each {rowIndex -> > (0..<columns).each { colIndex -> > rectangles["${rowIndex * columns + colIndex}"] = new > Rectangle2D.Float(colIndex * GRID_WIDTH, rowIndex * GRID_HEIGHT, GRID_WIDTH, > GRID_HEIGHT) > } > } > rectangles.each { key, rect -> > PDFTextStripperByArea textStripper = new PDFTextStripperByArea() > textStripper.addRegion(key, rect) > textStripper.extractRegions(page) > } > } > }{code} > > > PDFBox version 2.0.28 uses ever increasing memory, but version 2.0.27 does > not. > The test.pdf file I am using can be downloaded from Apple SEC filings page, > `8-K` from [https://investor.apple.com/sec-filings/default.aspx], but any 10+ > page pdf with a lot of text will work. > I have attached profiler screenshots of the difference. > Thanks in advance for your help. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org