[ https://issues.apache.org/jira/browse/PDFBOX-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542558#comment-14542558 ]
Michele Balistreri commented on PDFBOX-2798: -------------------------------------------- I do, but unfortunately I cannot share the file. Basically there is a stream where the content is UTF-16 encoded text. Reading it with the ISO-8859-1 charset creates a string with nulls after/before every character (depending on endianess) so "hello" would be read as "h\0e\0l\0\l\0o\0". Printing this string does not reveal the issue, but if you do any string manipulation then it will not work (like searching something in the string). This patch simply checks if the first two bytes are a BOM marker and then uses the correct encoding. > PDTextStream does not support UTF16 with BOM > -------------------------------------------- > > Key: PDFBOX-2798 > URL: https://issues.apache.org/jira/browse/PDFBOX-2798 > Project: PDFBox > Issue Type: Bug > Affects Versions: 1.8.10, 2.0.0 > Reporter: Michele Balistreri > Attachments: PDTextStream-UTF16.diff > > > the getAsString() method from PDTextStream is quite useful, but it does not > support UTF-16 text. I added a small check on the first two bytes to support > UTF-16 content. This is needed because two-bytes encodings do not degrade > gracefully like UTF-8, even for plain ASCII text and so the resulting string > is unusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org