[
https://issues.apache.org/jira/browse/PDFBOX-2009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972987#comment-13972987
]
Andreas Lehmkühler commented on PDFBOX-2009:
--------------------------------------------
The give example doesn't contain a problematic string as described. Just the
title string starts with a BOM and is decoded correctly using the PDFDebugger.
The text of the pdf consists of one single line
bq. (Dummy line) Tj
So, can you provide us with another sample pdf showing exactly the described
issue?
> PDFStreamEngine.processEncodedText incorrectly handling UTF-16 text with BOM
> FEFF
> ---------------------------------------------------------------------------------
>
> Key: PDFBOX-2009
> URL: https://issues.apache.org/jira/browse/PDFBOX-2009
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.0
> Reporter: Philip Helger
> Fix For: 2.0.0
>
> Attachments: test-properties.pdf
>
>
> When having a text print operation like
> <FEFF21222193219103B103A003A6> Tj
> than the PDFStreamEngine.processEncodedText does not handle this correctly.
> Am I correct that if a BOM was determined, the codelength should be set to 2
> (and not be changed)? Or should alternatively simply the BOM be skipped?
> It may be related to PDFBOX-920
--
This message was sent by Atlassian JIRA
(v6.2#6252)