[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937518#comment-13937518
 ] 

Tilman Hausherr commented on PDFBOX-1975:
-

I added an error log output if writeImage() returns false in rev 1578259.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Visible signature image

2014-03-17 Thread Vakhtang koroghlishvili
I'have just updated pdfbox and test this feature. Everything works well.







On Sat, Mar 15, 2014 at 10:35 AM, Tilman Hausherr wrote:

> I believe that somebody mentioned somewhere that creating the signature
> image didn't work properly, but I just can't find out who it was. While
> working on a test for JPEGFactory (PDFBOX-1969) I noticed that
> JPEGFactory.createFromImage() was temporarly broken (now hopefully no
> more), and this method is only used by PDVisibleSigBuilder.
> createSignatureImage().
>
> I see now that this was created in PDFBOX-1766 by Thomas and Vakhtang -
> please test whether it still works.
>
> Tilman
>


[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread Craig Strong (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13937894#comment-13937894
 ] 

Craig Strong commented on PDFBOX-1988:
--

Thank you John and Tilman.  That was very quick and effective work.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  

[jira] [Created] (PDFBOX-1989) Save LZW and other encoded PDImageXObject resources

2014-03-17 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-1989:
---

 Summary: Save LZW and other encoded PDImageXObject resources
 Key: PDFBOX-1989
 URL: https://issues.apache.org/jira/browse/PDFBOX-1989
 Project: PDFBox
  Issue Type: Improvement
  Components: PDModel
Affects Versions: 2.0.0
Reporter: Tilman Hausherr
Assignee: Tilman Hausherr
Priority: Minor
 Fix For: 2.0.0


The "logo" image of the file from PDFBOX-1147.png isn't extracted because 
PDImageXObject.getSuffix() returns null. Changing getSuffix() so that it 
returns png brings us a correct file.

With some other images, e.g. the raw_image_demo.pdf file, getSuffix() brings an 
NPE when getPDStream().getFilters() returns null. This happens with images that 
are uncompressed. Returning "png" for this case also brings us a nice image.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (PDFBOX-1989) Save LZW and other encoded PDImageXObject resources

2014-03-17 Thread Tilman Hausherr (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tilman Hausherr resolved PDFBOX-1989.
-

Resolution: Fixed

Done in rev 1578481.

> Save LZW and other encoded PDImageXObject resources
> ---
>
> Key: PDFBOX-1989
> URL: https://issues.apache.org/jira/browse/PDFBOX-1989
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.0
>
>
> The "logo" image of the file from PDFBOX-1147.png isn't extracted because 
> PDImageXObject.getSuffix() returns null. Changing getSuffix() so that it 
> returns png brings us a correct file.
> With some other images, e.g. the raw_image_demo.pdf file, getSuffix() brings 
> an NPE when getPDStream().getFilters() returns null. This happens with images 
> that are uncompressed. Returning "png" for this case also brings us a nice 
> image.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (PDFBOX-1990) Support creating PDF from lossless encoded images

2014-03-17 Thread Tilman Hausherr (JIRA)
Tilman Hausherr created PDFBOX-1990:
---

 Summary: Support creating PDF from lossless encoded images
 Key: PDFBOX-1990
 URL: https://issues.apache.org/jira/browse/PDFBOX-1990
 Project: PDFBox
  Issue Type: Improvement
Reporter: Tilman Hausherr
Priority: Minor


Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. We 
can pass a BufferedImage, but this one will be JPEG compressed which is not a 
good thing for graphics with sharp edges. I suggest that we support PNG as 
well. It is possible because the Flate Filter supports both directions.

My implementation (coming in a few minutes) is just an RGB based start that 
begs for improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1990) Support creating PDF from lossless encoded images

2014-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938188#comment-13938188
 ] 

Tilman Hausherr commented on PDFBOX-1990:
-

Done in rev 1578489 and 1578492 and 1578503. I also added a NullOutputStream.

> Support creating PDF from lossless encoded images
> -
>
> Key: PDFBOX-1990
> URL: https://issues.apache.org/jira/browse/PDFBOX-1990
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
>
> Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. 
> We can pass a BufferedImage, but this one will be JPEG compressed which is 
> not a good thing for graphics with sharp edges. I suggest that we support PNG 
> as well. It is possible because the Flate Filter supports both directions.
> My implementation (coming in a few minutes) is just an RGB based start that 
> begs for improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1990) Support creating PDF from lossless encoded images

2014-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938188#comment-13938188
 ] 

Tilman Hausherr edited comment on PDFBOX-1990 at 3/17/14 6:43 PM:
--

Done in rev 1578489 and 1578492 and 1578503 and 1578505. I also added a 
NullOutputStream.


was (Author: tilman):
Done in rev 1578489 and 1578492 and 1578503. I also added a NullOutputStream.

> Support creating PDF from lossless encoded images
> -
>
> Key: PDFBOX-1990
> URL: https://issues.apache.org/jira/browse/PDFBOX-1990
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
>
> Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. 
> We can pass a BufferedImage, but this one will be JPEG compressed which is 
> not a good thing for graphics with sharp edges. I suggest that we support PNG 
> as well. It is possible because the Flate Filter supports both directions.
> My implementation (coming in a few minutes) is just an RGB based start that 
> begs for improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1990) Support creating PDF from lossless encoded images

2014-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938192#comment-13938192
 ] 

Tilman Hausherr commented on PDFBOX-1990:
-

I also added gif, png and bmp to the ImageToPDF example in rev 1578507.

> Support creating PDF from lossless encoded images
> -
>
> Key: PDFBOX-1990
> URL: https://issues.apache.org/jira/browse/PDFBOX-1990
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Tilman Hausherr
>Priority: Minor
>
> Currently we support the insertion of TIFF and JPEG into a PDF, but not PNG. 
> We can pass a BufferedImage, but this one will be JPEG compressed which is 
> not a good thing for graphics with sharp edges. I suggest that we support PNG 
> as well. It is possible because the Flate Filter supports both directions.
> My implementation (coming in a few minutes) is just an RGB based start that 
> begs for improvement.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1975) Improve TestImageIOUtils unit tests to check image resolution and compression

2014-03-17 Thread Tilman Hausherr (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938322#comment-13938322
 ] 

Tilman Hausherr commented on PDFBOX-1975:
-

I added a test to save PDImageXObject objects from PDF within TestImageIOUtils 
in rev 1578544.

> Improve TestImageIOUtils unit tests to check image resolution and compression
> -
>
> Key: PDFBOX-1975
> URL: https://issues.apache.org/jira/browse/PDFBOX-1975
> Project: PDFBox
>  Issue Type: Task
>  Components: Utilities
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
>  Labels: imageio, test, tiff
> Fix For: 2.0.0
>
>
> Because of the problems with recent changes (see PDFBOX-1963), I will improve 
> the unit tests so that image resolution and compression is checked.
> I found out that JPEGs don't have a resolution, BMP had the wrong resolution. 
> The fault wasn't in the java TIFF writer as I thought before, it is in the 
> java PNG writer, which uses the PixelSize values wrongly, i.e. it interprets 
> them as "pixels per mm" instead of "mm per pixel" as per specification. The 
> JPEG writer throws an exception "JFIF APP0 must be first marker after SOI". 
> The BMP writer can set the resolution, but the BMP reader doesn't read it.
> (Some of this might be different depending on the version)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


PDFTextStripper.pageSeparator has no effect

2014-03-17 Thread Musall Maik
Hi,

I tried to use the parameter pageSeparator on PDFTextStripper and noticed that 
it has no effect. I checked the sources and discovered that in all versions up 
to the current trunk, the setting is simply not used anywhere.

The only method using a set separator is writePageSeperator(), which also 
includes a typo worth fixing, but this method isn’t called anywhere. It should 
probably be called in processPages(). However, and this is why I didn’t go 
ahead and submit a patch myself, what does happen is that the pageEnd marker is 
written, which is initialized to the value of pageSeparator. So if both get 
used, this will probably end up in the same marker emitted twice on each page 
break.

As a result, I’m unsure what to do about this and thought I’d leave it to the 
core team maintaining this, so I’m just reporting it here.

Regards
Maik



[jira] [Commented] (PDFBOX-1847) TSA Time Signature

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938516#comment-13938516
 ] 

John Hewson commented on PDFBOX-1847:
-

[~v.koroghlishvili] Ok, I applied the changes discussed in revision 1578650. I 
made some significant changes to the patch so that the singing functionality 
can be moved into pdfbox proper, rather than being part of the examples. 
Currently the code remains part of the examples until we're sure it works. Can 
you test out the new code and see if signing is working as you expected?

*Technical Notes*
Revision 1578650 includes changes to various other files, 
COSStandardOutputStream assumed that the OutputStream was always a 
FileOutputStream, which is obviously an unsafe assumption, in fact, output 
streams do not generally have a "position" at all, so I removed all code which 
broke that contract. COSWriter was treating its incremental update streams in a 
strange manner, it wanted the InputStream and OutputStream to be backed by the 
same underlying data, which is not generally possible, so I had to write new 
code to perform incremental writing in order not to break the Input/Output 
stream contract. This allows the incremental file to be written to a different 
stream from the one which was read. I also added some new loading and saving 
methods to PDDocument to make incremental updating easier, and to automatically 
keep track of File objects, when relevant.

> TSA Time Signature
> --
>
> Key: PDFBOX-1847
> URL: https://issues.apache.org/jira/browse/PDFBOX-1847
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Signing
>Affects Versions: 2.0.0
>Reporter: vakhtang koroghlishvili
>Assignee: John Hewson
> Fix For: 2.0.0
>
> Attachments: CreateSignature-updated.java.patch, 
> TSATimeSignature.patch, resultOfSigning.jpg
>
>
> When we was signing document, we was using time from our time. For more 
> security we can use Time Stamp server. 
> "Trusted timestamping is the process of securely keeping track of the 
> creation and modification time of a document. Security here means that no one 
> — not even the owner of the document — should be able to change it once it 
> has been recorded provided that the timestamper's integrity is never 
> compromised."(wiki)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (PDFBOX-1847) TSA Time Signature

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938516#comment-13938516
 ] 

John Hewson edited comment on PDFBOX-1847 at 3/17/14 10:55 PM:
---

[~v.koroghlishvili] Ok, I applied the changes discussed in revision 1578650. I 
made some significant changes to the patch so that the singing functionality 
can be moved into pdfbox proper, rather than being part of the examples. 
Currently the code remains part of the examples until we're sure it works. Can 
you test out the new code and see if signing is working as you expected?

I've added a command line flag to CreateSignature to allow passing a TSA server 
URL:

{code}
usage: java org.apache.pdfbox.examples.signature.CreateSignature 
  
options:
  -tsa sign timestamp using the given TSA server
{code}

*Technical Notes*
Revision 1578650 includes changes to various other files, 
COSStandardOutputStream assumed that the OutputStream was always a 
FileOutputStream, which is obviously an unsafe assumption, in fact, output 
streams do not generally have a "position" at all, so I removed all code which 
broke that contract. COSWriter was treating its incremental update streams in a 
strange manner, it wanted the InputStream and OutputStream to be backed by the 
same underlying data, which is not generally possible, so I had to write new 
code to perform incremental writing in order not to break the Input/Output 
stream contract. This allows the incremental file to be written to a different 
stream from the one which was read. I also added some new loading and saving 
methods to PDDocument to make incremental updating easier, and to automatically 
keep track of File objects, when relevant.


was (Author: jahewson):
[~v.koroghlishvili] Ok, I applied the changes discussed in revision 1578650. I 
made some significant changes to the patch so that the singing functionality 
can be moved into pdfbox proper, rather than being part of the examples. 
Currently the code remains part of the examples until we're sure it works. Can 
you test out the new code and see if signing is working as you expected?

*Technical Notes*
Revision 1578650 includes changes to various other files, 
COSStandardOutputStream assumed that the OutputStream was always a 
FileOutputStream, which is obviously an unsafe assumption, in fact, output 
streams do not generally have a "position" at all, so I removed all code which 
broke that contract. COSWriter was treating its incremental update streams in a 
strange manner, it wanted the InputStream and OutputStream to be backed by the 
same underlying data, which is not generally possible, so I had to write new 
code to perform incremental writing in order not to break the Input/Output 
stream contract. This allows the incremental file to be written to a different 
stream from the one which was read. I also added some new loading and saving 
methods to PDDocument to make incremental updating easier, and to automatically 
keep track of File objects, when relevant.

> TSA Time Signature
> --
>
> Key: PDFBOX-1847
> URL: https://issues.apache.org/jira/browse/PDFBOX-1847
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Signing
>Affects Versions: 2.0.0
>Reporter: vakhtang koroghlishvili
>Assignee: John Hewson
> Fix For: 2.0.0
>
> Attachments: CreateSignature-updated.java.patch, 
> TSATimeSignature.patch, resultOfSigning.jpg
>
>
> When we was signing document, we was using time from our time. For more 
> security we can use Time Stamp server. 
> "Trusted timestamping is the process of securely keeping track of the 
> creation and modification time of a document. Security here means that no one 
> — not even the owner of the document — should be able to change it once it 
> has been recorded provided that the timestamper's integrity is never 
> compromised."(wiki)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1983) Unable to add TIF images, CCITTFactory not working

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938547#comment-13938547
 ] 

John Hewson commented on PDFBOX-1983:
-

Cool, it looks like PDMemoryStream is the weak link, it's not really doing what 
it says it is.

> Unable to add TIF images, CCITTFactory not working
> --
>
> Key: PDFBOX-1983
> URL: https://issues.apache.org/jira/browse/PDFBOX-1983
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Joel Kääpä
>Assignee: Tilman Hausherr
> Fix For: 2.0.0
>
> Attachments: G4.tif, huhu.pdf
>
>
> As used in the AddImageToPDF example, the following line generates an error 
> with tif image:
> PDImageXObject ximage =  CCITTFactory.createFromRandomAccess(document, new 
> RandomAccessFile(new File(imagePath), "r"));
> java.io.IOException: Stream was not read
> at org.apache.pdfbox.cos.COSStream.getDecodeResult(COSStream.java:235)
> at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:80)
> at 
> org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject.(PDImageXObject.java:70)
> at 
> org.apache.pdfbox.pdmodel.graphics.image.CCITTFactory.createFromRandomAccess(CCITTFactory.java:50)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1987) Provide a PDF Lexer as a base for PDF parsing

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938560#comment-13938560
 ] 

John Hewson commented on PDFBOX-1987:
-

{quote}
An are which I kept out is how to handle malformed tokens such as strings which 
have an unbalanced number of parenthesis. 
{quote}

Do you have any sample PDF files with this problem?

> Provide a PDF Lexer as a base for PDF parsing
> -
>
> Key: PDFBOX-1987
> URL: https://issues.apache.org/jira/browse/PDFBOX-1987
> Project: PDFBox
>  Issue Type: Improvement
>  Components: Parsing
>Reporter: Maruan Sahyoun
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: src.zip
>
>
> In order to enhance the parsing process and as a foundation for a combination 
> of the different parsers a PDF lexer should be provided.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1969) JPEGFactory bug

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938571#comment-13938571
 ] 

John Hewson commented on PDFBOX-1969:
-

Ok, well if someone really wants support for JPEGs which use ARGB we can follow 
up on this, given that it has probably never worked (quite a bit of the 1.8 
image parsing code was like that).

> JPEGFactory bug
> ---
>
> Key: PDFBOX-1969
> URL: https://issues.apache.org/jira/browse/PDFBOX-1969
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Steven Burg
> Fix For: 2.0.0
>
>
> Attempted to run the RubberStampWithImage sample and received the following 
> errors:
> Exception in thread "main" java.lang.NullPointerException
>at 
> org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.createFromStream(JPEGFactory.java:72)
>at 
> org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.doIt(RubberStampWithImage.java:93)
>at 
> org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.main(RubberStampWithImage.java:185)
> This happens with any jog I tested with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1969) JPEGFactory bug

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938572#comment-13938572
 ] 

John Hewson commented on PDFBOX-1969:
-

Shall we close this issue?

> JPEGFactory bug
> ---
>
> Key: PDFBOX-1969
> URL: https://issues.apache.org/jira/browse/PDFBOX-1969
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Steven Burg
> Fix For: 2.0.0
>
>
> Attempted to run the RubberStampWithImage sample and received the following 
> errors:
> Exception in thread "main" java.lang.NullPointerException
>at 
> org.apache.pdfbox.pdmodel.graphics.image.JPEGFactory.createFromStream(JPEGFactory.java:72)
>at 
> org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.doIt(RubberStampWithImage.java:93)
>at 
> org.apache.pdfbox.examples.pdmodel.RubberStampWithImage.main(RubberStampWithImage.java:185)
> This happens with any jog I tested with.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1594) Add support for AES256 Encryption

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938576#comment-13938576
 ] 

John Hewson commented on PDFBOX-1594:
-

The problem is that this patch has been made against 1.8.4 rather than the 
trunk, and there are differences between the two. [~neon1] is it possible for 
you to make a new patch against the trunk?

> Add support for AES256 Encryption 
> --
>
> Key: PDFBOX-1594
> URL: https://issues.apache.org/jira/browse/PDFBOX-1594
> Project: PDFBox
>  Issue Type: Improvement
>Reporter: Maruan Sahyoun
> Fix For: 2.0.0
>
> Attachments: pdfbox-1.8.4-aes256.diff
>
>
> Adobe 9 added support for AES 256 encryption. Further information is 
> available at  
> http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/adobe_supplement_iso32000.pdf
>  (specially 3.5.1) or ISO 32000-2.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1512) TextPositionComparator is not compatible with Java 7

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938585#comment-13938585
 ] 

John Hewson commented on PDFBOX-1512:
-

Perhaps we should migrate away from using Collections.sort altogether and use 
some other sorting algorithm?

> TextPositionComparator is not compatible with Java 7
> 
>
> Key: PDFBOX-1512
> URL: https://issues.apache.org/jira/browse/PDFBOX-1512
> Project: PDFBox
>  Issue Type: Bug
>  Components: Text extraction
>Affects Versions: 1.7.1
> Environment: Java 7
>Reporter: Benjamin Papez
>Assignee: Andreas Lehmkühler
> Attachments: FOP-2252.pdf, TextPositionComparator.java, 
> WFI_PDFParser_TextPostionComparator.txt, immo-kurier_arsenal_93x62.pdf
>
>
> The TextPostionCompartor causes the following exception running on Java 7: 
> Unexpected RuntimeException from 
> org.apache.tika.parser.ParserDecorator$1@9007fa2 Original cause: Comparison 
> method violates its general contract!
> I think the problem is with this check:
> if ( yDifference < .1 ||
> (pos2YBottom >= pos1YTop && pos2YBottom <= pos1YBottom) ||
> (pos1YBottom >= pos2YTop && pos1YBottom <= pos2YBottom))
> as it violates the contract requirement:
> The implementor must also ensure that the relation is transitive: 
> ((compare(x, y)>0) && (compare(y, z)>0)) implies compare(x, z)>0.
> Finally, the implementor must ensure that compare(x, y)==0 implies that 
> sgn(compare(x, z))==sgn(compare(y, z)) for all z.
> Java 7 now is strict and throws exceptions when the contract is violated.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


Re: Visible signature image

2014-03-17 Thread John Hewson
I just made a unit test for CreateSignature and I’ll add one for visible 
signatures soon.

-- John

On 17 Mar 2014, at 07:31, Vakhtang koroghlishvili 
 wrote:

> I'have just updated pdfbox and test this feature. Everything works well.
> 
> 
> 
> 
> 
> 
> 
> On Sat, Mar 15, 2014 at 10:35 AM, Tilman Hausherr 
> wrote:
> 
>> I believe that somebody mentioned somewhere that creating the signature
>> image didn't work properly, but I just can't find out who it was. While
>> working on a test for JPEGFactory (PDFBOX-1969) I noticed that
>> JPEGFactory.createFromImage() was temporarly broken (now hopefully no
>> more), and this method is only used by PDVisibleSigBuilder.
>> createSignatureImage().
>> 
>> I see now that this was created in PDFBOX-1766 by Thomas and Vakhtang -
>> please test whether it still works.
>> 
>> Tilman
>> 



[jira] [Commented] (PDFBOX-1989) Save LZW and other encoded PDImageXObject resources

2014-03-17 Thread John Hewson (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938596#comment-13938596
 ] 

John Hewson commented on PDFBOX-1989:
-

+1

> Save LZW and other encoded PDImageXObject resources
> ---
>
> Key: PDFBOX-1989
> URL: https://issues.apache.org/jira/browse/PDFBOX-1989
> Project: PDFBox
>  Issue Type: Improvement
>  Components: PDModel
>Affects Versions: 2.0.0
>Reporter: Tilman Hausherr
>Assignee: Tilman Hausherr
>Priority: Minor
> Fix For: 2.0.0
>
>
> The "logo" image of the file from PDFBOX-1147.png isn't extracted because 
> PDImageXObject.getSuffix() returns null. Changing getSuffix() so that it 
> returns png brings us a correct file.
> With some other images, e.g. the raw_image_demo.pdf file, getSuffix() brings 
> an NPE when getPDStream().getFilters() returns null. This happens with images 
> that are uncompressed. Returning "png" for this case also brings us a nice 
> image.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread Craig Strong (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13938606#comment-13938606
 ] 

Craig Strong commented on PDFBOX-1988:
--

I tested the fix on the 2.0.0 build and it worked.  Thanks again.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>

[jira] [Closed] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread Craig Strong (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Craig Strong closed PDFBOX-1988.



Closing the issue.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>   
>

[jira] [Reopened] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson reopened PDFBOX-1988:
-


Reopening because we leave issues open until the version they were fixed in is 
released.

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at 
> org.apache.pdfb

[jira] [Resolved] (PDFBOX-1988) PDFBox ExtractText issue of PDF with no embedded fonts

2014-03-17 Thread John Hewson (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Hewson resolved PDFBOX-1988.
-

Resolution: Fixed

> PDFBox ExtractText issue of PDF with no embedded fonts
> --
>
> Key: PDFBOX-1988
> URL: https://issues.apache.org/jira/browse/PDFBOX-1988
> Project: PDFBox
>  Issue Type: Bug
>  Components: Rendering, Text extraction
>Affects Versions: 1.8.4
> Environment: Windows 7
> Also, PASE on IBM i
>Reporter: Craig Strong
>  Labels: patch
> Fix For: 1.8.5, 2.0.0
>
> Attachments: Test1.pdf
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I have been using PDFBox 1.8.4 to extract text from several different PDF 
> files fine.  I use the latest PDFBox app with ExtractText command line.  
> There is one PDF that PDFBox (and iText) fails to extract any text even 
> though I can extract the text with Adobe Reader and also pdftotext.exe part 
> of XPdf.  "java -jar pdfbox-app-1.8.4.jar ExtractText Test1.pdf Out.txt".  I 
> don't want to have to rely on using pdftotext.exe from a PC since this is 
> part of an automated application.  I think the error relates to an unknown 
> font type and having to use the few fonts installed in the jar file.  I tried 
> running the API classes and trying to force a font from a certain location 
> but I still got errors.  I thought I loaded the font with the loadTTF method 
> but I don't know if that did anything with the font.  I would really like to 
> have this working straight from the ExtractText class anyway.
> Here are the errors I am getting.  I tried this from both a Windows 7 PC and 
> our IBM i in the PASE environment but I get the same errors.  The section 
> starting processEncodedText and on repeats a few times so I just included the 
> first entries.
>  
> Mar 10, 2014 3:50:44 PM org.apache.pdfbox.pdmodel.font.PDFontFactory 
> createFont   
> WARNING: Substituting TrueType for unknown font subtype=  
> 
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processOperator
> WARNING: java.lang.NullPointerException   
> 
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:375)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.ensureFontDescriptor(PDTrueTypeFont.java:221)
> 
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.(PDTrueTypeFont.java:119) 
>
> at 
> org.apache.pdfbox.pdmodel.font.PDFontFactory.createFont(PDFontFactory.java:121)
>   
> at 
> org.apache.pdfbox.pdmodel.PDResources.getFonts(PDResources.java:204)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.getFonts(PDFStreamEngine.java:604) 
>
> at 
> org.apache.pdfbox.util.operator.SetTextFont.process(SetTextFont.java:54)  
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>  
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:235)
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processStream(PDFStreamEngine.java:215)
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPage(PDFTextStripper.java:456)  
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.processPages(PDFTextStripper.java:381) 
>
> at 
> org.apache.pdfbox.util.PDFTextStripper.writeText(PDFTextStripper.java:340)
>
> at 
> org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:275)   
>
> at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
>   
> at org.apache.pdfbox.PDFBox.main(PDFBox.java:58)  
>   
> Mar 10, 2014 3:50:45 PM org.apache.pdfbox.util.PDFStreamEngine 
> processEncodedText   
> WARNING: java.lang.NullPointerException   
>   
> Throwable occurred: java.lang.NullPointerException
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processEncodedText(PDFStreamEngine.java:355)
> at org.apache.pdfbox.util.operator.ShowText.process(ShowText.java:45) 
> 
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processOperator(PDFStreamEngine.java:554)
>
> at 
> org.apache.pdfbox.util.PDFStreamEngine.processSubStream(PDFStreamEngine.java:268)
>