This is probably no help, and this is not a solution either; Perhaps it is more rambling looking for a solution on my part. I want to mention that this is not an area that I am familiar with, but I thought that I would give it a shot, right or wrong I have learned a little bit. It appears to me that the characters you mentioned are actually defined in the CMap file org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H
The CMap file is read as a resource, if your class path were to resolve the CMap file in a different directory perhaps from an earlier installation, which did not define the characters that would cause the problem. I was wondering if perhaps the character map is getting corrupted somehow, but I have no proof of this. Lets start with the hex values of the numbers below, "✠"(✠) and "Ⓔ"(Ⓔ) 9402 = x24BA 10016 = x2720 Below is a link to the definition of CMap or Character Map files descriptions. http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf Here is a link to ToUnicode Mapping File Tutorial http://www.adobe.com/devnet/acrobat/pdfs/5411.ToUnicode.pdf Look in this file: org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H It should start like this. %!PS-Adobe-3.0 Resource-CMap %%DocumentNeededResources: ProcSet (CIDInit) %%IncludeResource: ProcSet (CIDInit) %%BeginResource: CMap (UniJIS-UCS2-H) You will find that the character 24BA is not defined as a character mapping. It is defined in a character range mapping. 100 begincidrange <24b6> <24cf> 10339 This means that the character should map like this. 24b6 -> "x2863" 24b6 is 10339 or x2863 24b7 is 10340 24b8 is 10341 24b9 is 10342 24ba is 10343 <- This is your character or x2867 This is the character you are looking for. %% 9402=24BA E-o 24ba CIRCLED LATIN CAPITAL LETTER E However if you look in the other japaneese character mapping files, the character 24BA is explicitly defined as a character mapping: org/apache/pdfbox/resources/cmap/adobe-Japan1-UCS2 You will find a mapping for the CIRCLED LATIN CAPITAL LETTER E character. org\apache\pdfbox\resources\cmap\Adobe-Japan1-UCS2 1 beginbfchar <24BA> <004F030A> endbfchar It is also defined as a different mapping in this file. org\apache\pdfbox\resources\cmap\Adobe-CNS1-UCS2 1 beginbfchar <24BA> <75F6> endbfchar http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf > On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur > <[email protected]> wrote: > > I am using the PDFBox for one of the application. What I am > doing is I > > am extracting the PDF text from the PDF and generating the TOC > > entries. But I am facing one problem, that is, if the PDF contains > > these two characters "✠"(✠) and "Ⓔ"(Ⓔ) then the > > processpage(PDPage, > > COSStream) gives an IOException "Unknown encoding for > 'UniJIS-UCS2-H' ". Can > > you let us know is there any way as to overcome this problem? > > Unfortunately not. Unless someone else has a good answer, > you'll probably need to look at the relevant source code in > PDFBox to figure out what to do with this. If you do that, > we'd be happy to apply any fix you may come up with. I'm haven't a better answer than Jukka, but perhaps a hint were to look for the solution. As far as I understand, the are several unicode-mappings defined in Resources/cmap. You have to look, if the 2 characters you mentioned above are part of the mapping-table "UniJIS-UCS2-H". If not, the question will be: is there a problem with the mapping-file or with the document-producing software. HTH Andreas ---------------------------------------------------------------- Vorsitzender des Aufsichtsrates: Alwin Fitting Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), Stefan Niehusmann Sitz der Gesellschaft: Dortmund Eingetragen beim Amtsgericht Dortmund Handelsregister-Nr. HR B 21222 USt.-IdNr. DE 2588 96 719
