[ 
https://issues.apache.org/jira/browse/PDFBOX-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542558#comment-14542558
 ] 

Michele Balistreri commented on PDFBOX-2798:
--------------------------------------------

I do, but unfortunately I cannot share the file.

Basically there is a stream where the content is UTF-16 encoded text. Reading 
it with the ISO-8859-1 charset creates a string with nulls after/before every 
character (depending on endianess) so "hello" would be read as 
"h\0e\0l\0\l\0o\0". Printing this string does not reveal the issue, but if you 
do any string manipulation then it will not work (like searching something in 
the string). 

This patch simply checks if the first two bytes are a BOM marker and then uses 
the correct encoding.

> PDTextStream does not support UTF16 with BOM
> --------------------------------------------
>
>                 Key: PDFBOX-2798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2798
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.10, 2.0.0
>            Reporter: Michele Balistreri
>         Attachments: PDTextStream-UTF16.diff
>
>
> the getAsString() method from PDTextStream is quite useful, but it does not 
> support UTF-16 text. I added a small check on the first two bytes to support 
> UTF-16 content. This is needed because two-bytes encodings do not degrade 
> gracefully like UTF-8, even for plain ASCII text and so the resulting string 
> is unusable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to