[jira] [Updated] (PDFBOX-1109) Data corruption related to scratch file use

John Hewson (JIRA) Fri, 10 Oct 2014 12:48:08 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


John Hewson updated PDFBOX-1109:
--------------------------------
    Fix Version/s: 2.0.0

> Data corruption related to scratch file use
> -------------------------------------------
>
>                 Key: PDFBOX-1109
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1109
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.7, 2.0.0
>            Reporter: Stefan Mücke
>            Assignee: Andreas Lehmkühler
>            Priority: Critical
>             Fix For: 2.0.0
>
>         Attachments: COSDocument.java, PagedMultiRandomAccessFile.java, 
> PagedMultiRandomAccessFileTest.java
>
>
> PDFBox uses a scratch file to reduce memory consumption. However, there is no 
> mechanism that prevents two PDStreams from writing to the scratch file at the 
> same time. When this happens, the resulting PDF contains garbage in some 
> streams. This problem occurred several times to me (e.g. when writing to an 
> image stream while constructing a page).
> Reproducing the bug
> *******************
> One can easily reproduce the bug. Open file AddImageToPDF.java and move the 
> following line:
>     PDPageContentStream contentStream =
>         new PDPageContentStream(doc, page, true, true);
> immediately after the line in which the PDPage object is fetched:
>     PDPage page =
>         (PDPage)doc.getDocumentCatalog().getAllPages().get( 0 );
>         
> With this modification, one will still get a PDF file, but Acrobat Reader 
> will report that the image could not be processed. BTW, the files 
> AddImageToPDF.java and ImageToPDF.java are almost identical. One of them 
> should be deleted.
> Bug-Fix
> *******
> The problem can be solved by using a scratch file that is divided into pages 
> (e.g. of 4 KB). Each PDStream in the scratch file is then associated with a 
> list of pages. This list grows as more data is written to the stream.
> The bug fix requires minimal changes to the existing code. The very nice 
> RandomAccess interface made this very easy.
> Here is what needs to be changed:
>     - Add the attached "PagedMultiRandomAccessFile.java" to the I/O package
>     - Change COSDocument.getScratchFile() to return a RandomAccess
>       instance provided by PagedMultiRandomAccessFile:
>       private PagedMultiRandomAccessFile scratchFile = null;
>       [...]
>       public COSDocument(File scratchDir) throws IOException {
>               tmpFile = File.createTempFile("pdfbox", "tmp", scratchDir);
>               scratchFile = new PagedMultiRandomAccessFile(
>                       new RandomAccessFile(tmpFile, "rw"));
>       }
>       public COSDocument(RandomAccess file) {
>               // scratchFile = file;
>               throw new RuntimeException("Not yet implemented."); 
> //$NON-NLS-1$
>       }
>       
>       [...]
>       /**
>        * Returns a new scratch file.
>        *
>        * @return the newly created scratch file
>        */
>       public RandomAccess getScratchFile() {
>               return scratchFile.getNewRandomAcess();
>       }
> One of the COSDocument constructors takes a RandomAccess file. This 
> constructor is only called in a single location, namely, in method 
> PDFParser.parse(). I am not sure if the RandomAccess parameter provided here 
> is really a scratch file. Someone will have to decide what to do with this 
> one.
> The code has been throughly tested and has been used in the production of 
> several books without any problems.
> In the attachment please find the code. There is also a JUnit test that was 
> used to debug my code. I have added an Apache license header and adopted 
> PDFBox's code style. Feel free to make any desired changes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PDFBOX-1109) Data corruption related to scratch file use

Reply via email to