[ https://issues.apache.org/jira/browse/PDFBOX-920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Antoni Mylka updated PDFBOX-920: -------------------------------- Component/s: Text extraction This is a text extraction issue. Added an appropriate "component" marking. > PDFStreamEngine.processEncodedText fails on UTF-16 text > ------------------------------------------------------- > > Key: PDFBOX-920 > URL: https://issues.apache.org/jira/browse/PDFBOX-920 > Project: PDFBox > Issue Type: Bug > Components: Text extraction > Affects Versions: 1.3.1 > Reporter: Antoni Mylka > Attachments: nullcharactername.patch > > > I have a PDF document which yields gibberish text. When I debug it, I get to > the PDFStreamEngine.processEncodedText. The method gets a following byte > array: > [0, 47, 0, 82, 0, 82, 0, 78, 0, 3, 0, 68, 0, 87, 0, 3, 0, 87, 0, 75, 0, 72, > 0, 3, 0, -64, 0, 85, 0, 86, 0, 87, 0, 3, 0, 83, 0, 76, 0, 70, 0, 87, 0, 88, > 0, 85, 0, 72, 0, 3, 0, 68, 0, 69, 0, 82, 0, 89, 0, 72, 0, 17, 0, 3] > This looks to me like some UTF16 text, but the codes seem different than what > you'd normally expect. I don't understand the encoding. In 1.2.1 this yielded > the correct output though ("Look at the picture above"). In the 1.3.1 and the > current trunk this is converted to garbage. The culprit is here: > codeLength = 1; > String c = font.encode( string, i, codeLength ); > if( c == null && i+1<string.length) > { > //maybe a multibyte encoding > codeLength++; > c = font.encode( string, i, codeLength ); > } > So the code first tries to 'encode' a single byte as a character, and then > tries two bytes, three bytes etc. First it starts with a 00 byte. In 1.2.1 > the PDFont.encode would return null. The program would then try with two > bytes getting a correct character on the second attempt. > In the current trunk the font.encode method returns a space " " when 00 is > passed. This is clearly wrong, because afterwards the entire string is parsed > incorrectly. I tried to debug further and it seems to me that the problem is > in the Encoding class, in the getName method. It looks like this: > public String getName( int code ) throws IOException > { > String name = codeToName.get( code ); > if( name == null ) > { > //lets be forgiving for now > name = "space"; > } > return name; > } > The crucial bit is the "let's be forgiving for now". If a code is unknown in > the encoding, a space is returned. In my case this completely breaks the > parsing of a file. > What was the rationale behind this behavior? Removing it fixed my problem and > didn't break anything. All unit tests of pdfbox pass. The regression tests of > my applications (based on the pdf extraction code from the Aperture > Framework) also pass. The "forgiving" part has been added in PDFBOX-626, but > the issue description doesn't name any reasons for that. If the "forgiveness" > is there for a good reason, I'd be grateful for advice how to deal with the > problem. Otherwise please remove it. > Unfortunately I can't share the problem file. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.