[ 
https://issues.apache.org/jira/browse/PDFBOX-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil McErlean updated PDFBOX-1011:
----------------------------------

    Description: 
I have a document which has Author metadata = "Jırg Boettger". That second 
character is not an 'i', it is a dotless lower case i.
It is also an encrypted pdf, with user password = "".

The problem is that if I load the document, decrypt it and then try to examine 
the document-level metadata (such as author) I see problems with the non-ASCII 
chars.
I will attach a testcase & the sample PDF that reproduces the problem for me.

A bit of detail that may be useful: COSObject 19 0 at the end of the PDF 
defines the Author. It is represented as byte[] = {-95, -118, -50, 122, -127, 
105, 53, 105, 50, 14, -27, 122, 120}
SecurityHandler.encryptData() line 223 - which decrypts the string gives: J?rg 
Boettger. bytes = [74, -102, 114, 103, 32, 66, 111, 101, 116, 116, 103, 101, 
114]
Note the -102 in the second character.
The second character, whose byte-value is -102 is not a displayable ASCII char 
(even at 256 -102 = 154) and it just gets dropped from the COSString & so we 
get an author of Jrg Boettger from PDDocumentInformation.

I'm not sure what the requirements are for handling non-ASCII chars in this 
situation. But Adobe Reader 9, 10 & Mac OS X's Preview application all show the 
correct author value.

  was:
I have a document which has Author metadata = "Jırg Boettger". That second 
character is not an 'i', it is a dotless lower case i.
It is also an encrypted pdf, with user password = "".

The problem is that if I load the document, decrypt it and then try to examine 
the document-level metadata (such as author) I see problems with the non-ASCII 
chars.
I will attach a testcase & the sample PDF that reproduces the problem for me.

A bit of detail that may be useful: COSObject 19 0 at the end of the PDF 
defines the Author. It is represented as byte[] = {-95, -118, -50, 122, -127, 
105, 53, 105, 50, 14, -27, 122, 120}
SecurityHandler.encryptData() line 223 - which decrypts the string gives: J?rg 
Boettger. bytes = [74, -102, 114, 103, 32, 66, 111, 101, 116, 116, 103, 101, 
114]
Note the -102 in the second character.
The second character, whose byte-value is -102 is not a displayable ASCII char 
(even at 256 -102 = 154) and it just gets dropped from the COSString & so we 
get an author of Jrg Boettger from PDDocumentInformation.


> Incorrect metadata for encrypted PDFs with non-ASCII characters
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-1011
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1011
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 1.5.0
>         Environment: I'm on Mac OS X 10.6, but I would think it affects all.
>            Reporter: Neil McErlean
>            Priority: Minor
>         Attachments: MHH_Torque_convertor2.pdf, testcase.patch
>
>
> I have a document which has Author metadata = "Jırg Boettger". That second 
> character is not an 'i', it is a dotless lower case i.
> It is also an encrypted pdf, with user password = "".
> The problem is that if I load the document, decrypt it and then try to 
> examine the document-level metadata (such as author) I see problems with the 
> non-ASCII chars.
> I will attach a testcase & the sample PDF that reproduces the problem for me.
> A bit of detail that may be useful: COSObject 19 0 at the end of the PDF 
> defines the Author. It is represented as byte[] = {-95, -118, -50, 122, -127, 
> 105, 53, 105, 50, 14, -27, 122, 120}
> SecurityHandler.encryptData() line 223 - which decrypts the string gives: 
> J?rg Boettger. bytes = [74, -102, 114, 103, 32, 66, 111, 101, 116, 116, 103, 
> 101, 114]
> Note the -102 in the second character.
> The second character, whose byte-value is -102 is not a displayable ASCII 
> char (even at 256 -102 = 154) and it just gets dropped from the COSString & 
> so we get an author of Jrg Boettger from PDDocumentInformation.
> I'm not sure what the requirements are for handling non-ASCII chars in this 
> situation. But Adobe Reader 9, 10 & Mac OS X's Preview application all show 
> the correct author value.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to