This is probably no help, and this is not a solution either; 

Perhaps it is more rambling looking for a solution on my part.
I want to mention that this is not an area that I am familiar with, but I 
thought that I would give it a shot, right or wrong I have learned a little 
bit. It appears to me that the characters you mentioned are actually defined in 
the CMap file org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H

The CMap file is read as a resource, if your class path were to resolve the 
CMap file in a different directory perhaps from an earlier installation,
which did not define the characters that would cause the problem.

I was wondering if perhaps the character map is getting corrupted somehow, but 
I have no proof of this.

Lets start with the hex values of the numbers below, "✠"(✠) and 
"Ⓔ"(Ⓔ)

 9402 = x24BA
10016 = x2720

Below is a link to the definition of CMap or Character Map files descriptions.

http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf


Here is a link to ToUnicode Mapping File Tutorial 
http://www.adobe.com/devnet/acrobat/pdfs/5411.ToUnicode.pdf


Look in this file:  org/apache/pdfbox/resources/cmap/UniJIS-UCS2-H
It should start like this.

%!PS-Adobe-3.0 Resource-CMap
%%DocumentNeededResources: ProcSet (CIDInit)
%%IncludeResource: ProcSet (CIDInit)
%%BeginResource: CMap (UniJIS-UCS2-H)

You will find that the character 24BA is not defined as a character mapping. 
It is defined in a character range mapping.

100 begincidrange

<24b6> <24cf> 10339

This means that the character should map like this.

24b6 -> "x2863"

24b6 is 10339 or x2863
24b7 is 10340
24b8 is 10341
24b9 is 10342
24ba is 10343   <- This is your character or x2867

This is the character you are looking for.

%% 9402=24BA E-o    24ba    CIRCLED LATIN CAPITAL LETTER E

However if you look in the other japaneese character mapping files, the 
character 24BA is explicitly defined as a character mapping:
        org/apache/pdfbox/resources/cmap/adobe-Japan1-UCS2

You will find a mapping for the CIRCLED LATIN CAPITAL LETTER E character.

org\apache\pdfbox\resources\cmap\Adobe-Japan1-UCS2
1 beginbfchar
<24BA> <004F030A>
endbfchar

It is also defined as a different mapping in this file.
org\apache\pdfbox\resources\cmap\Adobe-CNS1-UCS2
1 beginbfchar
<24BA> <75F6>
endbfchar


http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf


http://www.adobe.com/devnet/font/pdfs/5099.CMapFiles.pdf



> On Wed, Jan 21, 2009 at 10:56 AM, Natraj Kadur
> <[email protected]> wrote:
> > I am using the PDFBox for one of the application. What I am
> doing is I
> > am extracting the PDF text from the PDF and generating the TOC
> > entries. But I am facing one problem, that is, if the PDF contains
> > these two characters "&#10016;"(✠) and "&#9402;"(Ⓔ) then the
> > processpage(PDPage,
> > COSStream) gives an IOException "Unknown encoding for
> 'UniJIS-UCS2-H' ". Can
> > you let us know is there any way as to overcome this problem?
>
> Unfortunately not. Unless someone else has a good answer,
> you'll probably need to look at the relevant source code in
> PDFBox to figure out what to do with this. If you do that,
> we'd be happy to apply any fix you may come up with.
I'm haven't a better answer than Jukka, but perhaps a hint were to look for the 
solution.
As far as I understand, the are several unicode-mappings defined in 
Resources/cmap. You have to look,
if the 2 characters you mentioned above are part of the mapping-table 
"UniJIS-UCS2-H". If not, the question
will be: is there a problem with the mapping-file or with the 
document-producing software.

HTH
Andreas
----------------------------------------------------------------
Vorsitzender des Aufsichtsrates: Alwin Fitting
Geschaeftsfuehrung: Chittur Ramakrishnan (Vorsitzender), 
Stefan Niehusmann

Sitz der Gesellschaft: Dortmund
Eingetragen beim Amtsgericht Dortmund 
Handelsregister-Nr. HR B 21222 
USt.-IdNr. DE 2588 96 719

Reply via email to