[ https://issues.apache.org/jira/browse/PDFBOX-1011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Neil McErlean updated PDFBOX-1011: ---------------------------------- Description: I have a document which has Author metadata = "Jırg Boettger". That second character is not an 'i', it is a dotless lower case i. It is also an encrypted pdf, with user password = "". The problem is that if I load the document, decrypt it and then try to examine the document-level metadata (such as author) I see problems with the non-ASCII chars. I will attach a testcase & the sample PDF that reproduces the problem for me. A bit of detail that may be useful: COSObject 19 0 at the end of the PDF defines the Author. It is represented as byte[] = {-95, -118, -50, 122, -127, 105, 53, 105, 50, 14, -27, 122, 120} SecurityHandler.encryptData() line 223 - which decrypts the string gives: J?rg Boettger. bytes = [74, -102, 114, 103, 32, 66, 111, 101, 116, 116, 103, 101, 114] Note the -102 in the second character. The second character, whose byte-value is -102 is not a displayable ASCII char (even at 256 -102 = 154) and it just gets dropped from the COSString & so we get an author of Jrg Boettger from PDDocumentInformation. I'm not sure what the requirements are for handling non-ASCII chars in this situation. But Adobe Reader 9, 10 & Mac OS X's Preview application all show the correct author value. was: I have a document which has Author metadata = "Jırg Boettger". That second character is not an 'i', it is a dotless lower case i. It is also an encrypted pdf, with user password = "". The problem is that if I load the document, decrypt it and then try to examine the document-level metadata (such as author) I see problems with the non-ASCII chars. I will attach a testcase & the sample PDF that reproduces the problem for me. A bit of detail that may be useful: COSObject 19 0 at the end of the PDF defines the Author. It is represented as byte[] = {-95, -118, -50, 122, -127, 105, 53, 105, 50, 14, -27, 122, 120} SecurityHandler.encryptData() line 223 - which decrypts the string gives: J?rg Boettger. bytes = [74, -102, 114, 103, 32, 66, 111, 101, 116, 116, 103, 101, 114] Note the -102 in the second character. The second character, whose byte-value is -102 is not a displayable ASCII char (even at 256 -102 = 154) and it just gets dropped from the COSString & so we get an author of Jrg Boettger from PDDocumentInformation. > Incorrect metadata for encrypted PDFs with non-ASCII characters > --------------------------------------------------------------- > > Key: PDFBOX-1011 > URL: https://issues.apache.org/jira/browse/PDFBOX-1011 > Project: PDFBox > Issue Type: Bug > Components: Parsing > Affects Versions: 1.5.0 > Environment: I'm on Mac OS X 10.6, but I would think it affects all. > Reporter: Neil McErlean > Priority: Minor > Attachments: MHH_Torque_convertor2.pdf, testcase.patch > > > I have a document which has Author metadata = "Jırg Boettger". That second > character is not an 'i', it is a dotless lower case i. > It is also an encrypted pdf, with user password = "". > The problem is that if I load the document, decrypt it and then try to > examine the document-level metadata (such as author) I see problems with the > non-ASCII chars. > I will attach a testcase & the sample PDF that reproduces the problem for me. > A bit of detail that may be useful: COSObject 19 0 at the end of the PDF > defines the Author. It is represented as byte[] = {-95, -118, -50, 122, -127, > 105, 53, 105, 50, 14, -27, 122, 120} > SecurityHandler.encryptData() line 223 - which decrypts the string gives: > J?rg Boettger. bytes = [74, -102, 114, 103, 32, 66, 111, 101, 116, 116, 103, > 101, 114] > Note the -102 in the second character. > The second character, whose byte-value is -102 is not a displayable ASCII > char (even at 256 -102 = 154) and it just gets dropped from the COSString & > so we get an author of Jrg Boettger from PDDocumentInformation. > I'm not sure what the requirements are for handling non-ASCII chars in this > situation. But Adobe Reader 9, 10 & Mac OS X's Preview application all show > the correct author value. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira