[jira] [Commented] (PDFBOX-2009) PDFStreamEngine.processEncodedText incorrectly handling UTF-16 text with BOM FEFF

JIRA Thu, 17 Apr 2014 07:39:28 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972987#comment-13972987
 ]


Andreas Lehmkühler commented on PDFBOX-2009:
--------------------------------------------

The give example doesn't contain a problematic string as described. Just the 
title string starts with a BOM and is decoded correctly using the PDFDebugger. 
The text of the pdf consists of one single line

bq. (Dummy line) Tj

So, can you provide us with another sample pdf showing exactly the described 
issue?

> PDFStreamEngine.processEncodedText incorrectly handling UTF-16 text with BOM 
> FEFF
> ---------------------------------------------------------------------------------
>
>                 Key: PDFBOX-2009
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2009
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Philip Helger
>             Fix For: 2.0.0
>
>         Attachments: test-properties.pdf
>
>
> When having a text print operation like
> <FEFF21222193219103B103A003A6> Tj
> than the PDFStreamEngine.processEncodedText does not handle this correctly.
> Am I correct that if a BOM was determined, the codelength should be set to 2 
> (and not be changed)? Or should alternatively simply the BOM be skipped?
> It may be related to PDFBOX-920



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (PDFBOX-2009) PDFStreamEngine.processEncodedText incorrectly handling UTF-16 text with BOM FEFF

Reply via email to