[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-10 Thread Thomas Chojecki (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844070#comment-13844070
 ] 

Thomas Chojecki commented on PDFBOX-1792:
-

Some files cause parsing exceptions. First I did not know if my project is 
missconfigured. After checking the offsets at which the parser stop working,I 
saw that some files are broken. The first one has only garbage after %%EOF. I'm 
at work so I can't give you much informations about the exact files and 
stacktraces.

Maybe we does not speak about the same test OR at your environment the test 
can't find any files? Can you check if the file array contains at least one 
testfile?

File dir = new File("src/test/resources/input");
for (File f : dir.listFiles()){
  if (f.getName().toLowerCase().endsWith(".pdf")){
testSingleFileEquality(f);
  }
}

Additionally I can't commit the three testfiles from the archive. See my mail 
at the dev mailing list.

> Different metadata extracted with NonSequentialPDFParser vs classic parser on 
> some documents
> 
>
> Key: PDFBOX-1792
> URL: https://issues.apache.org/jira/browse/PDFBOX-1792
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf
>
>
> The traditional parser is able to extract metadata from a test document from 
> TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
> that file.  Another file from the Tika test suite has metadata that can be 
> extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-10 Thread Eric Leleu (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844085#comment-13844085
 ] 

Eric Leleu commented on PDFBOX-1803:


Hi,

IMHO returning a "Dummy Calendar" if the parameter is null may be quite 
confusing but if the JavaDoc describes all cases it could be sufficiant.
Why do you want to avoid throwing an exception in case of illegal argument ?

BR,
Eric


> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBOX-DateConverter-1.8-fred.patch, 
> PDFBOX-DateConverter-Trunk-fred.patch, PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1792) Different metadata extracted with NonSequentialPDFParser vs classic parser on some documents

2013-12-10 Thread JIRA

[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844104#comment-13844104
 ] 

Andreas Lehmkühler commented on PDFBOX-1792:


The testcase you are talking about wasn't there in the first place. You added 
it when "disabling" it. Have a look at revision 1458423 before your checkin

http://svn.apache.org/viewvc/pdfbox/branches/1.8/pdfbox/src/test/java/org/apache/pdfbox/pdmodel/TestPDDocumentInformation.java?revision=1458423&view=markup

The issue only exists in your local environment. Otherwise the jenkins build 
should have failed, but it didn't.

IMO you should revert your changes and once the issue with the other pdf and 
the parsing it is solved, we should (re)add the testcase and the sample pdf as 
well. But let's do that in the trunk first.

> Different metadata extracted with NonSequentialPDFParser vs classic parser on 
> some documents
> 
>
> Key: PDFBOX-1792
> URL: https://issues.apache.org/jira/browse/PDFBOX-1792
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel
>Affects Versions: 1.8.3
>Reporter: Tim Allison
>Priority: Minor
> Attachments: PDFBOX-1792.tar.gz, testPDF_acroForm2.pdf
>
>
> The traditional parser is able to extract metadata from a test document from 
> TIKA-738.  The NonSequentialPDFParser is not able to extract metadata from 
> that file.  Another file from the Tika test suite has metadata that can be 
> extracted by the NonSequentialPDFParser but not by classic. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


Re: Problem commiting txt resources to svn

2013-12-10 Thread Andreas Lehmkühler
Hi,

> Thomas Chojecki  hat am 10. Dezember 2013 um 00:15
> geschrieben:
>
>
> Hallo,
> has anyone similar problems with committing text files? In my case I'm 
> using a linux box and the file is UTF-16 LE encoded. I configured my 
> SVN client as described in the beginners guide and added the content 
> of http://www.apache.org/dev/svn-eol-style.txt to the config.
>
> My svn client throws this shorten error
> svn: E29: Kann »svn:eol-style« nicht setzen: Datei 
> ».../testAnnotations.pdf-sorted.txt« hat die MIME-Typ Eigenschaft 
> »binär«
>
> In english it should be something like this
> svn: E29: File '.../testAnnotations.pdf-sorted.txt' has binary 
> mime type property
Hmm, sounds like: this is supposed to be a text file but it looks like a binary.
Are you sure that the file doesn't contains any suspicious data? Is maybe
the BOM to missing so that it can't be detected as UTF-16LE encoded?

> Do I need to add svn:mime-type=text/plain or something else to the 
> config for *.txt files?
I'm not an expert. but give it a try. I don't expect any unwanted sideeffects

> Best regards
> Thomas

BR
Andreas Lehmkühler


Re: Problem commiting txt resources to svn

2013-12-10 Thread Andreas Lehmkühler
Hi,

> Thomas Chojecki  hat am 10. Dezember 2013 um 00:15
> geschrieben:
>
>
> Hallo,
> has anyone similar problems with committing text files? In my case I'm 
> using a linux box and the file is UTF-16 LE encoded. I configured my 
> SVN client as described in the beginners guide and added the content 
> of http://www.apache.org/dev/svn-eol-style.txt to the config.
>
> My svn client throws this shorten error
> svn: E29: Kann »svn:eol-style« nicht setzen: Datei 
> ».../testAnnotations.pdf-sorted.txt« hat die MIME-Typ Eigenschaft 
> »binär«
>
> In english it should be something like this
> svn: E29: File '.../testAnnotations.pdf-sorted.txt' has binary 
> mime type property
>
> Do I need to add svn:mime-type=text/plain or something else to the 
> config for *.txt files?
I found some hints in the internet that subversion didn't support UTF-16 as text
from the beginning and threated those as binary files in older version.
So the question is, which version are you using? Maybe a version bump
could help here.

> Best regards
> Thomas


BR
Andreas Lehmkühler


[jira] [Assigned] (PDFBOX-1790) NPE during PDTrueTypeFont.loadTTF() on Mac TrueType font lacking Windows-platformID CMAPEncodingEntry

2013-12-10 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/PDFBOX-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler reassigned PDFBOX-1790:
--

Assignee: Andreas Lehmkühler

> NPE during PDTrueTypeFont.loadTTF() on Mac TrueType font lacking 
> Windows-platformID CMAPEncodingEntry
> -
>
> Key: PDFBOX-1790
> URL: https://issues.apache.org/jira/browse/PDFBOX-1790
> Project: PDFBox
>  Issue Type: Bug
>Affects Versions: 1.8.2
> Environment: Mac 10.7 / Java 6
>Reporter: Andrew Thomas
>Assignee: Andreas Lehmkühler
>Priority: Critical
>
> I'm attempting to embed a TrueType font using PDFBox, on the Mac, using 
> PDTrueType.loadTTF( PDDocument, InputStream, Encoding ).
> For TrueType fonts originating from Windows (e.g., Tahoma) this works.
> For TrueType fonts originating from the Mac (e.g., Apple Chancery), a 
> NullPointerException is thrown.
> java.lang.NullPointerException
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadDescriptorDictionary(PDTrueTypeFont.java:409)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:201)
> at 
> org.apache.pdfbox.pdmodel.font.PDTrueTypeFont.loadTTF(PDTrueTypeFont.java:177)
> I stepped through the code in a debugger. The method 
> PDTrueTypeFont.loadDescriptorDictionary() loops through the cmap table for 
> the font, looking for a cmap with platform ID 3 (Windows), and sets the 
> variable unimap only if one is found. After that loop, the variable unimap is 
> dereferenced without checking for null.
> Some Mac TrueType fonts have platform IDs 0 (Unicode) and 1 (Mac), but not 3 
> (Windows).
> At the least, a null check seems required. But more desirable would be 
> support for Mac TrueType fonts.
> Am I missing something, or should I enter a bug?
> Example problem font:
> * Apple Chancery (Included with OS)
> Details:
> * PDFBox version: 1.8.2 [Have not yet tested with 1.8.3, which was released a 
> few days ago]
> * Platform: Mac
> * Java 6
> * Font platform IDs: 0, 1



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-1803) StringIndexOutOfBound on DateConverter.toCalendar

2013-12-10 Thread Fred Hansen (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844288#comment-13844288
 ] 

Fred Hansen commented on PDFBOX-1803:
-

Before rewriting DateConverter I reviewed all the JIRA reports about it. 
Several such as PDFBOX-977 and PDFBOX-145 were IOExceptions. This was the 
documented response to an invalid date, but the programs had been unprepared to 
deal with it.  Some of my thinking was along these lines:
{panel:bgColor=#ffd}
- programmers cut corners by failing to read documentation
- dates are important, but if one is invalid the user penalty should not be an 
exception that halts processing of the PDF
- an exception is a heavy weight response while a date format error is a 
light-weight failure
- how should a program handle a bad date? at best it can leave the date blank 
or put in a dummy value
{panel}
So I decided on the dummy value scheme. The program can be written paying no 
attention to errors and not failing. The user may see the dummy value, but he 
or she is unlikely to confuse that with a "real" creation date. There are no 
exceptions to trip up the programmer or prematurely terminate execution. And 
yet the bad date can easily be detected and processed, if desired.

> StringIndexOutOfBound on DateConverter.toCalendar
> -
>
> Key: PDFBOX-1803
> URL: https://issues.apache.org/jira/browse/PDFBOX-1803
> Project: PDFBox
>  Issue Type: Bug
>  Components: PDModel, Utilities
>Affects Versions: 1.8.3
>Reporter: Eric Leleu
>Priority: Minor
> Attachments: PDFBOX-DateConverter-1.8-fred.patch, 
> PDFBOX-DateConverter-Trunk-fred.patch, PDFBox-DateConverter-Br18.patch, 
> PDFBox-DateConverter-Trunk.patch
>
>
> Some PDF have an empty string as CreationDate &  ModDate in the Information 
> Dictionary.
> According to the PDF specification, this two element are optional.
> My first fix was to test the null & the empty string in the 
> toCalendar(String, String[]) method and I return null if one of the both 
> condition is verified.
> But according to a test case(TestDateUtil) a NullPointer is expected on null 
> value of text. Can you explain why this behaviour has been adopted?
> To fixe this unexpected exception in my execution path, I have added a test 
> on the empty string in the deprecated method toCalendar(String). (Patch in 
> attachment)
> I'm waiting your comment before commit this patch (or change it by my first 
> implementation)
> BR,
> Eric



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Created] (PDFBOX-1807) TextToPDF strips leading spaces from input file

2013-12-10 Thread Mark Mitchell (JIRA)
Mark Mitchell created PDFBOX-1807:
-

 Summary: TextToPDF strips leading spaces from input file
 Key: PDFBOX-1807
 URL: https://issues.apache.org/jira/browse/PDFBOX-1807
 Project: PDFBox
  Issue Type: Bug
  Components: Utilities
Affects Versions: 1.8.3
 Environment: Win7 64 bit
Reporter: Mark Mitchell
Priority: Minor


When using the TextToPDF utility on a text file that has spaces in the front 
for formatting purposes, the leading spaces on the line are being stripped 
causing the report to no longer looks like it did in the PDF.  

Was this the intended result?  Is there a way to turn off the stripping of the 
spaces?  If not, can it be added?



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (PDFBOX-15) Add ability to extract comments

2013-12-10 Thread Roy van Kaathoven (JIRA)

[ 
https://issues.apache.org/jira/browse/PDFBOX-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13844547#comment-13844547
 ] 

Roy van Kaathoven commented on PDFBOX-15:
-

I need this feature to read comments from tagged PDFs, any news on when this 
will be merged or what prevents this from being merged?

> Add ability to extract comments
> ---
>
> Key: PDFBOX-15
> URL: https://issues.apache.org/jira/browse/PDFBOX-15
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
> Attachments: PDFBOX.patch, R5542565B_S01101R_6217168_original.PDF
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1010721
> Originally submitted by benlitchfield on 2004-08-17 06:24.
> Create command line app to extract comments from a 
> document.
> Ben



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Issue Comment Deleted] (PDFBOX-15) Add ability to extract comments

2013-12-10 Thread Roy van Kaathoven (JIRA)

 [ 
https://issues.apache.org/jira/browse/PDFBOX-15?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Roy van Kaathoven updated PDFBOX-15:


Comment: was deleted

(was: I need this feature to read comments from tagged PDFs, any news on when 
this will be merged or what prevents this from being merged?)

> Add ability to extract comments
> ---
>
> Key: PDFBOX-15
> URL: https://issues.apache.org/jira/browse/PDFBOX-15
> Project: PDFBox
>  Issue Type: New Feature
>  Components: PDModel
> Attachments: PDFBOX.patch, R5542565B_S01101R_6217168_original.PDF
>
>
> [imported from SourceForge]
> http://sourceforge.net/tracker/index.php?group_id=78314&atid=552835&aid=1010721
> Originally submitted by benlitchfield on 2004-08-17 06:24.
> Create command line app to extract comments from a 
> document.
> Ben



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)