[
https://issues.apache.org/jira/browse/PDFBOX-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542946#comment-14542946
]
John Hewson edited comment on PDFBOX-2798 at 5/14/15 12:03 AM:
---------------------------------------------------------------
Ok, I decided to bite the bullet and replace PDTextStream with String in 2.0.
The current implementation of PDTextStream always gets the encoding wrong and
its misuse has led to bugs such as PDFBOX-2797 and other such buggy code which
I found in FDFJavaScript and AppearanceGenerator, in both cases because name
trees are not type safe. The PDTextStream abstraction isn't worth keeping
because it's purpose is to represent "this is a string", a need better served
by, well, a String. Removing PDTextStream has also allowed the removal if a ton
of helper code, which shows just how expensive and confusing this abstraction
was.
This change mostly affect the AcroForm APIs, which have been moving toward
using Strings over COS objects anyway.
For 1.8 Michele Balistreri's approach looks good - obviously we have to be
careful not to introduce regressions into the stable branch.
was (Author: jahewson):
Ok, I decided to bite the bullet and replace PDTextStream with String in 2.0.
The current implementation of PDTextStream always gets the encoding wrong and
its misuse has led to bugs such as PDFBOX-2797 and other such buggy code which
I found in FDFJavaScript and AppearanceGenerator, in both cases because name
trees are not type safe. The PDTextStream abstraction isn't worth keeping
because it's purpose is to represent "this is a string", a need better served
by, well, a String. Removing PDTextStream has also allowed the removal if a ton
of support code, which shows just how expensive and confusing this abstraction
was.
This change mostly affect the AcroForm APIs, which have been moving toward
using Strings over COS objects anyway.
For 1.8 Michele Balistreri's approach looks good - obviously we have to be
careful not to introduce regressions into the stable branch.
> PDTextStream does not support UTF16 with BOM
> --------------------------------------------
>
> Key: PDFBOX-2798
> URL: https://issues.apache.org/jira/browse/PDFBOX-2798
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.8.10, 2.0.0
> Reporter: Michele Balistreri
> Attachments: PDTextStream-UTF16.diff
>
>
> the getAsString() method from PDTextStream is quite useful, but it does not
> support UTF-16 text. I added a small check on the first two bytes to support
> UTF-16 content. This is needed because two-bytes encodings do not degrade
> gracefully like UTF-8, even for plain ASCII text and so the resulting string
> is unusable.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]