[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format

2008-03-07 Thread Adam Buchbinder
I should also add that acroread 7 (on Linux) exports at least ASCII-only
text as plain ASCII (it may be PDFDocEncoding, but I didn't have any
special characters in it), so we wouldn't be breaking compatibility by
doing that.

-- 
generate_fdf extracts fields in UTF-16 format
https://bugs.launchpad.net/bugs/192398
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format

2008-02-17 Thread freemed
** Changed in: pdftk (Ubuntu)
   Status: New => Confirmed

-- 
generate_fdf extracts fields in UTF-16 format
https://bugs.launchpad.net/bugs/192398
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format

2008-02-17 Thread Bug Watch Updater
** Changed in: pdftk (Debian)
   Status: Unknown => Confirmed

-- 
generate_fdf extracts fields in UTF-16 format
https://bugs.launchpad.net/bugs/192398
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format

2008-02-16 Thread Adam Buchbinder
I commented too soon. The supported encodings list in Adobe's
implementations is very short (p. 1025); in Acrobat 4.0, it consists
only of Shift-JIS; in 5.0, only Shift-JIS, UHC, GBK, and BigFive. (The
spec doesn't say what later versions accept.) I had assumed that
PDFDocEncoding was something like UTF-8, but it's a superset of Latin-1,
so converting to PDFDocEncoding by default will mangle any text that
uses odd characters. There's also a note (p. 132) explaining that
Unicode strings must be encoded as UTF-16BE with a BOM to start with in
order to unambiguously distinguish them from PDFDocEncoding strings.
Converting to UTF-8 will make the exported forms information
incompatible with at least some implementations.

The best possible solution I can think of here is to see if the string
can be reencoded in PDFDocEncoding without missing any characters, and
if it can't, leaving it in UTF-16. This would maintain backwards
compatibility while making it way, way more hand-editable.

-- 
generate_fdf extracts fields in UTF-16 format
https://bugs.launchpad.net/bugs/192398
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format

2008-02-16 Thread Adam Buchbinder
Consulting the PDF Reference 1.6 (
http://partners.adobe.com/public/developer/en/pdf/PDFReference16.pdf ),
there's an optional "Encoding" field (p.674) in the FDF dictionary,
which defines handling for strings which don't begin with the BOM. It
defaults to PDFDocEncoding, which seems sensible. To generate human-
readable strings, it would seem sensible to convert the strings to the
PDFDocEncoding when they're extracted.

-- 
generate_fdf extracts fields in UTF-16 format
https://bugs.launchpad.net/bugs/192398
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs


[Bug 192398] Re: generate_fdf extracts fields in UTF-16 format

2008-02-16 Thread Adam Buchbinder
The following workaround will turn the fields in the generated FDF files
into plain ASCII, assuming that they're convertible, by filtering out
the BOMs and the embedded NULLs. (ASCII text converted to UTF-16 looks
exactly like the result of sticking NULLs before or after (depending on
byte order) each character.)

I doubt it will work if the field names contain anything other than
ASCII.

$ cat Project2.fdf | sed -e's/\x00//g' | sed -e's/\xFE\xFF//g' | less

-- 
generate_fdf extracts fields in UTF-16 format
https://bugs.launchpad.net/bugs/192398
You received this bug notification because you are a member of Ubuntu
Bugs, which is the bug contact for Ubuntu.

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs