On Thu, Jan 17, 2008 at 10:21:11AM +0100, Tomas Pospisek <[EMAIL PROTECTED]> was heard to say: > On Thu, 17 Jan 2008, Christian Perrier wrote: > >> Quoting Tomas Pospisek ([EMAIL PROTECTED]): >> >>>> The file *is* UTF-8 from what I see. >>> >>> I'm looking at it through konsole, which runs bash. Konsole's encoding >>> is set to "Default". If I set it to UTF8, it still doesn't render. >> >> It does, in the exact same conditions on my system. > > I did this: > > $ apt-cache show ttf-ecolier-lignes-court > /tmp/k > $ vim /tmp/k > > and clearly, the problem here is *not* the displaying/decoding/the fonts. > The problem is apt-cache, since if I look at the produced output in > /tmp/k it's still cut off at: > > "Description: cursive roman font (with r"
I can confirm that pkgRecords::Parser::ShortDesc() returns a truncated string for this Description if I run it with LC_ALL=C. It looks like apt's description extraction routine attempts to transcode it from UTF-8 to the current locale without paying attention to error conditions. As a result, the string gets truncated at the first character that can't be translated. This is a bit odd since iconv(3) and the glibc docs say that iconv stops when the output buffer is full or when an invalid or incomplete character is encountered in the input buffer. Untranslatable characters aren't mentioned. If I instrument UTF8ToCodeset to save the return value of iconv and the value of errno afterwards (neither of which gdb lets me do, grr) I can see that iconv returns -1 and sets errno to EILSEQ, invalid byte sequence. This seems wrong, since the source codeset is hardcoded to UTF-8, but maybe iconv is just reporting that as the closest analogue to "I couldn't represent this code point in the output encoding". Anyway, the result of all this is that you get a truncated string. I'd suggest taking a look at what aptitude does to handle errors (basically inserting "?" characters at locations that can't be converted). Actually, I'd even go so far as to suggest that apt should just return all strings in UTF-8 rather than trying to be clever and guess what the client code wants, but it's probably way too late for that :-/. Daniel -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

