Copy Page from one Document to another: Page Content Stream Linked to Original
Document
---------------------------------------------------------------------------------------
Key: PDFBOX-1093
URL: https://issues.apache.org/jira/browse/PDFBOX-1093
Project: PDFBox
Issue Type: Bug
Components: Text extraction
Affects Versions: 1.6.0
Reporter: [email protected]
When a page is grabbed from one document and added to another (via addPage or
importPage) the Content Stream of the page retains the scratchFile and
unFiltered/FilteredStreams from it's original document. This means that a page
is always connected to it's original document and not wholly a part of it's new
document.
The problem with this situation:
-When searching for text within a large (800,000 page) pdf file performance can
potentially be increased if the pdf file is split into single pages for
incremental text extraction. Each page is searched individually rather than an
entire document search. To achieve this, a new document is created and a single
page from the original pdf is added.
-When searching through these 1 page documents, the scratchFile of the original
pdf is used, and it will grow as the text from each page is extracted. This
leads to an out of memory condition, which appears as a "SEVERE Stop reading
corrupt stream" exception from doDecode() as the write buffer attempts to
expand to a size greater than the maximum heap size.
A workaround for this problem is to create a new document, add the page to the
document, save the document, close it and then load it again. Unfortunately the
performance cost of this workaround is prohibitive.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira