[ 
https://issues.apache.org/jira/browse/PDFBOX-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoni Mylka updated PDFBOX-920:
--------------------------------

    Component/s: Text extraction

This is a text extraction issue. Added an appropriate "component" marking.

> PDFStreamEngine.processEncodedText fails on UTF-16 text
> -------------------------------------------------------
>
>                 Key: PDFBOX-920
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-920
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.1
>            Reporter: Antoni Mylka
>         Attachments: nullcharactername.patch
>
>
> I have a PDF document which yields gibberish text. When I debug it, I get to 
> the PDFStreamEngine.processEncodedText. The method gets a following byte 
> array:
> [0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72, 
> 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88, 
> 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3]
> This looks to me like some UTF16 text, but the codes seem different than what 
> you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded 
> the correct output though ("Look at the picture above"). In the 1.3.1 and the 
> current trunk this is converted to garbage. The culprit is here:
> codeLength = 1;
> String c = font.encode( string, i, codeLength );
> if( c == null && i+1<string.length)
> {
>       //maybe a multibyte encoding
>       codeLength++;
>       c = font.encode( string, i, codeLength );
> }
> So the code first tries to 'encode' a single byte as a character, and then 
> tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1 
> the PDFont.encode would return null. The program would then try with two 
> bytes getting a correct character on the second attempt.
> In the current trunk the font.encode method returns a space " " when 00 is 
> passed. This is clearly wrong, because afterwards the entire string is parsed 
> incorrectly. I tried to debug further and it seems to me that the problem is 
> in the Encoding class, in the getName method. It looks like this:
> public String getName( int code ) throws IOException
> {
>       String name = codeToName.get( code );
>       if( name == null )
>       {
>               //lets be forgiving for now
>               name = "space";
>       }
>       return name;
> }
> The crucial bit is the "let's be forgiving for now". If a code is unknown in 
> the encoding, a space is returned. In my case this completely breaks the 
> parsing of a file. 
> What was the rationale behind this behavior? Removing it fixed my problem and 
> didn't break anything. All unit tests of pdfbox pass. The regression tests of 
> my applications (based on the pdf extraction code from the Aperture 
> Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but 
> the issue description doesn't name any reasons for that. If the "forgiveness" 
> is there for a good reason, I'd be grateful for advice how to deal with the 
> problem. Otherwise please remove it.
> Unfortunately I can't share the problem file.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to