[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format
I should also add that acroread 7 (on Linux) exports at least ASCII-only text as plain ASCII (it may be PDFDocEncoding, but I didn't have any special characters in it), so we wouldn't be breaking compatibility by doing that. -- generate_fdf extracts fields in UTF-16 format https://bugs.launchpad.net/bugs/192398 You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format
** Changed in: pdftk (Ubuntu) Status: New => Confirmed -- generate_fdf extracts fields in UTF-16 format https://bugs.launchpad.net/bugs/192398 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format
** Changed in: pdftk (Debian) Status: Unknown => Confirmed -- generate_fdf extracts fields in UTF-16 format https://bugs.launchpad.net/bugs/192398 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format
I commented too soon. The supported encodings list in Adobe's implementations is very short (p. 1025); in Acrobat 4.0, it consists only of Shift-JIS; in 5.0, only Shift-JIS, UHC, GBK, and BigFive. (The spec doesn't say what later versions accept.) I had assumed that PDFDocEncoding was something like UTF-8, but it's a superset of Latin-1, so converting to PDFDocEncoding by default will mangle any text that uses odd characters. There's also a note (p. 132) explaining that Unicode strings must be encoded as UTF-16BE with a BOM to start with in order to unambiguously distinguish them from PDFDocEncoding strings. Converting to UTF-8 will make the exported forms information incompatible with at least some implementations. The best possible solution I can think of here is to see if the string can be reencoded in PDFDocEncoding without missing any characters, and if it can't, leaving it in UTF-16. This would maintain backwards compatibility while making it way, way more hand-editable. -- generate_fdf extracts fields in UTF-16 format https://bugs.launchpad.net/bugs/192398 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format
Consulting the PDF Reference 1.6 ( http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf ), there's an optional "Encoding" field (p.674) in the FDF dictionary, which defines handling for strings which don't begin with the BOM. It defaults to PDFDocEncoding, which seems sensible. To generate human- readable strings, it would seem sensible to convert the strings to the PDFDocEncoding when they're extracted. -- generate_fdf extracts fields in UTF-16 format https://bugs.launchpad.net/bugs/192398 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs
[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format
The following workaround will turn the fields in the generated FDF files into plain ASCII, assuming that they're convertible, by filtering out the BOMs and the embedded NULLs. (ASCII text converted to UTF-16 looks exactly like the result of sticking NULLs before or after (depending on byte order) each character.) I doubt it will work if the field names contain anything other than ASCII. $ cat Project2.fdf | sed -e's/\x00//g' | sed -e's/\xFE\xFF//g' | less -- generate_fdf extracts fields in UTF-16 format https://bugs.launchpad.net/bugs/192398 You received this bug notification because you are a member of Ubuntu Bugs, which is the bug contact for Ubuntu. -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs