[ 
https://issues.apache.org/jira/browse/PDFBOX-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Barrett updated PDFBOX-1515:
--------------------------------

    Attachment: The right to take risks.pdf

Exception can be reproduced using this pdf file, with straightforward code 
along the lines of:

        parser.parse();
        cosDoc = parser.getDocument();
        pdfStripper = new PDFTextStripper();
        pdDoc = new PDDocument(cosDoc);
        parsedText = pdfStripper.getText(pdDoc);

exception occurs after getText call.
                
> PDGraphicsState class receives null page argument leading to 
> nullpointerexception
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1515
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1515
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel, Utilities
>    Affects Versions: 1.7.1
>         Environment: all (os-x, ubuntu linux, win-32, win64)
>            Reporter: Tim Barrett
>            Priority: Critical
>         Attachments: The right to take risks.pdf
>
>
> workaround changes needed for PDGraphicsState constructor as reproduced below:
> public PDGraphicsState(PDRectangle page) {
>               /*
>                * TB - changes made here are a workaround which creates a 
> default
>                * GeneralPath assigned to currentClippingPath if the 
> constructor
>                * argument page is null. Probably a better remedy would be to 
> ensure
>                * that the page argument is not null or use a dedicated 
> constructor if
>                * page is null
>                */
>               if (page != null) {
>                       Dimension dimension = page.createDimension();
>                       Rectangle rectangle = new Rectangle(dimension);
>                       currentClippingPath = new GeneralPath(rectangle);
>                       currentClippingPath = new GeneralPath(new 
> Rectangle(page.createDimension()));
>                       if (page.getLowerLeftX() != 0 || page.getLowerLeftY() 
> != 0) {
>                               // Compensate for offset
>                               this.currentTransformationMatrix = 
> this.currentTransformationMatrix.multiply(Matrix.getTranslatingInstance(-page.getLowerLeftX(),
>                                               -page.getLowerLeftY()));
>                       }
>               } else {
>                       currentClippingPath = new GeneralPath();
>               }
>       }
> Also, as a side effect of above workaround, made following change within 
> PDFStreamEngine.processEncodedText:
> /*
>                * TB - needed to make change here, as we encounter here a 
> knock on
>                * effect of allowing null page arguments through in 
> PDGraphicsState
>                * constructor which creates a default GeneralPath assigned to
>                * currentClippingPath. That workaround causes findMediaBox to 
> return
>                * null, so in that case we assign default values to pageHeight 
> and
>                * pageWidth here. Everything else seems to work as far as text
>                * extraction is concerned.
>                */
>               if (page.findMediaBox() != null) {
>                       pageHeight = page.findMediaBox().getHeight();
>                       pageWidth = page.findMediaBox().getWidth();
>               } else {
>                       pageHeight = 0;
>                       pageWidth = 0;
>               }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to