[ 
https://issues.apache.org/jira/browse/PDFBOX-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542946#comment-14542946
 ] 

John Hewson edited comment on PDFBOX-2798 at 5/14/15 12:03 AM:
---------------------------------------------------------------

Ok, I decided to bite the bullet and replace PDTextStream with String in 2.0. 
The current implementation of PDTextStream always gets the encoding wrong and 
its misuse has led to bugs such as PDFBOX-2797 and other such buggy code which 
I found in FDFJavaScript and AppearanceGenerator, in both cases because name 
trees are not type safe. The PDTextStream abstraction isn't worth keeping 
because it's purpose is to represent "this is a string", a need better served 
by, well, a String. Removing PDTextStream has also allowed the removal if a ton 
of helper code, which shows just how expensive and confusing this abstraction 
was.

This change mostly affect the AcroForm APIs, which have been moving toward 
using Strings over COS objects anyway.

For 1.8 Michele Balistreri's approach looks good - obviously we have to be 
careful not to introduce regressions into the stable branch.


was (Author: jahewson):
Ok, I decided to bite the bullet and replace PDTextStream with String in 2.0. 
The current implementation of PDTextStream always gets the encoding wrong and 
its misuse has led to bugs such as PDFBOX-2797 and other such buggy code which 
I found in FDFJavaScript and AppearanceGenerator, in both cases because name 
trees are not type safe. The PDTextStream abstraction isn't worth keeping 
because it's purpose is to represent "this is a string", a need better served 
by, well, a String. Removing PDTextStream has also allowed the removal if a ton 
of support code, which shows just how expensive and confusing this abstraction 
was.

This change mostly affect the AcroForm APIs, which have been moving toward 
using Strings over COS objects anyway.

For 1.8 Michele Balistreri's approach looks good - obviously we have to be 
careful not to introduce regressions into the stable branch.

> PDTextStream does not support UTF16 with BOM
> --------------------------------------------
>
>                 Key: PDFBOX-2798
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2798
>             Project: PDFBox
>          Issue Type: Bug
>    Affects Versions: 1.8.10, 2.0.0
>            Reporter: Michele Balistreri
>         Attachments: PDTextStream-UTF16.diff
>
>
> the getAsString() method from PDTextStream is quite useful, but it does not 
> support UTF-16 text. I added a small check on the first two bytes to support 
> UTF-16 content. This is needed because two-bytes encodings do not degrade 
> gracefully like UTF-8, even for plain ASCII text and so the resulting string 
> is unusable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org
For additional commands, e-mail: dev-h...@pdfbox.apache.org

Reply via email to