Update of bug #58206 (project groff): Status: Need Info => In Progress
_______________________________________________________ Follow-up Comment #14: I'm mostly unblocked. The problem with the original problematic file (the "angular 1200x800" thing) appears to be that it had a title property that was encoded in UTF-16BE. $ xxd angular-1280-800.pdf | sed -n '/459f0/,/45a30/p' 000459f0: 3c3c 0a2f 5469 746c 6520 3c30 3036 3130 <<./Title <00610 00045a00: 3036 4530 3036 3730 3037 3530 3036 4330 06E00670075006C0 00045a10: 3036 3130 3037 3230 3032 4430 3033 3130 0610072002D00310 00045a20: 3033 3230 3033 3830 3033 3030 3032 4430 03200380030002D0 00045a30: 3033 3830 3033 3030 3033 3030 3030 303e 038003000300000> You don't see that a lot these days, with the success of the global campaign to exterminate big-endian desktop (and mobile) computing. So this is what pdfinfo ends up doing with that. $ pdfinfo angular-1280-800.pdf | xxd 00000000: 5469 746c 653a 2020 2020 2020 2020 2020 Title: 00000010: 0061 006e 0067 0075 006c 0061 0072 002d .a.n.g.u.l.a.r.- 00000020: 0031 0032 0038 0030 002d 0038 0030 0030 .1.2.8.0.-.8.0.0 00000030: 0000 0a50 726f 6475 6365 723a 2020 2020 ...Producer: 00000040: 2020 2068 7474 7073 3a2f 2f69 6d61 6765 https://image 00000050: 6d61 6769 636b 2e6f 7267 0a43 7265 6174 magick.org.Creat 00000060: 696f 6e44 6174 653a 2020 204d 6f6e 2041 ionDate: Mon A 00000070: 7072 2032 3020 3034 3a33 333a 3434 2032 pr 20 04:33:44 2 00000080: 3032 3020 4145 5354 0a4d 6f64 4461 7465 020 AEST.ModDate 00000090: 3a20 2020 2020 2020 204d 6f6e 2041 7072 : Mon Apr 000000a0: 2032 3020 3034 3a33 333a 3434 2032 3032 20 04:33:44 202 000000b0: 3020 4145 5354 0a54 6167 6765 643a 2020 0 AEST.Tagged: 000000c0: 2020 2020 2020 206e 6f0a 5573 6572 5072 no.UserPr 000000d0: 6f70 6572 7469 6573 3a20 6e6f 0a53 7573 operties: no.Sus 000000e0: 7065 6374 733a 2020 2020 2020 206e 6f0a pects: no. 000000f0: 466f 726d 3a20 2020 2020 2020 2020 2020 Form: 00000100: 6e6f 6e65 0a4a 6176 6153 6372 6970 743a none.JavaScript: 00000110: 2020 2020 206e 6f0a 5061 6765 733a 2020 no.Pages: 00000120: 2020 2020 2020 2020 310a 456e 6372 7970 1.Encryp 00000130: 7465 643a 2020 2020 2020 6e6f 0a50 6167 ted: no.Pag 00000140: 6520 7369 7a65 3a20 2020 2020 2031 3238 e size: 128 00000150: 3020 7820 3830 3020 7074 730a 5061 6765 0 x 800 pts.Page 00000160: 2072 6f74 3a20 2020 2020 2020 300a 4669 rot: 0.Fi 00000170: 6c65 2073 697a 653a 2020 2020 2020 3238 le size: 28 00000180: 3539 3337 2062 7974 6573 0a4f 7074 696d 5937 bytes.Optim 00000190: 697a 6564 3a20 2020 2020 206e 6f0a 5044 ized: no.PD 000001a0: 4620 7665 7273 696f 6e3a 2020 2020 312e F version: 1. 000001b0: 330a 3. In other words, it simply blasts the encoded bytes to its own output in utter indifference to the character encoding used by the output device. For an information-extraction tool whose entire purpose is human-readable output, that seems a dubious decision to me. But, we're stuck with it for the time being (unless a PDFPIC user wants to migrate to Deri's lower-level output driver-leveraging alternative in comment #7). I'll see if I can force a UTF-16 Title property onto gnu.eps so that I can craft a proper regression test. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?58206> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/