[ https://issues.apache.org/jira/browse/PDFBOX-2798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14542946#comment-14542946 ]
John Hewson edited comment on PDFBOX-2798 at 5/14/15 12:04 AM: --------------------------------------------------------------- Ok, I decided to bite the bullet and replace PDTextStream with String in 2.0. The current implementation of PDTextStream always gets the encoding wrong and its misuse has led to bugs such as PDFBOX-2797 and other such buggy code which I found in FDFJavaScript and AppearanceGenerator, in both cases because name trees are not type safe. The PDTextStream abstraction isn't worth keeping because it's purpose is to represent "this is a string", a need better served by, well, a String. Removing PDTextStream has also allowed the removal if a ton of helper code, which shows just how expensive and confusing this abstraction was. Especially, as I mentioned above, because PDTextStream isn't a "text stream" at all, but "either a text string or a text stream". This change mostly affect the AcroForm APIs, which have been moving toward using Strings over COS objects anyway. For 1.8 Michele Balistreri's approach looks good - obviously we have to be careful not to introduce regressions into the stable branch. was (Author: jahewson): Ok, I decided to bite the bullet and replace PDTextStream with String in 2.0. The current implementation of PDTextStream always gets the encoding wrong and its misuse has led to bugs such as PDFBOX-2797 and other such buggy code which I found in FDFJavaScript and AppearanceGenerator, in both cases because name trees are not type safe. The PDTextStream abstraction isn't worth keeping because it's purpose is to represent "this is a string", a need better served by, well, a String. Removing PDTextStream has also allowed the removal if a ton of helper code, which shows just how expensive and confusing this abstraction was. Especially, as I said above, that PDTextStream isn't a "text stream" at all, but "either a text string or a text stream". This change mostly affect the AcroForm APIs, which have been moving toward using Strings over COS objects anyway. For 1.8 Michele Balistreri's approach looks good - obviously we have to be careful not to introduce regressions into the stable branch. > PDTextStream does not support UTF16 with BOM > -------------------------------------------- > > Key: PDFBOX-2798 > URL: https://issues.apache.org/jira/browse/PDFBOX-2798 > Project: PDFBox > Issue Type: Bug > Affects Versions: 1.8.10, 2.0.0 > Reporter: Michele Balistreri > Attachments: PDTextStream-UTF16.diff > > > the getAsString() method from PDTextStream is quite useful, but it does not > support UTF-16 text. I added a small check on the first two bytes to support > UTF-16 content. This is needed because two-bytes encodings do not degrade > gracefully like UTF-8, even for plain ASCII text and so the resulting string > is unusable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@pdfbox.apache.org For additional commands, e-mail: dev-h...@pdfbox.apache.org