Hi,

First, thanks for replying. I appreciate it.

At 03:54 AM 11/30/2006, [EMAIL PROTECTED] wrote:
>On Wed, Nov 29, 2006 at 05:25:25PM -0800, Daniel Yek wrote:
> > Hi,
> >
> > I am attempting to handle raw filenames (which may be encoded differently
> > than the character set used by the filesystem) gracefully.
> >
>[...]
> > with a raw character outside of UTF-8 character set):
> >
> > Character:  P  r  e  s  e  n  t  a  c  i  ó  n  ó     .  s  x  i
> > Hex code:   50 72 65 73 65 6e 74 61 63 69 f3 6e c3 b3 2e 73 78 69
> >
> > To be converted to this:
> > Character:  P  r  e  s  e  n  t  a  c  i  %  f  3  n  ó     .  s  x  i
> > Hex code:   50 72 65 73 65 6e 74 61 63 69 25 66 33 6e c3 b3 2e 73 78 69
>
>And how is the converter supposed to guess that this "raw character"
>(here 0xf3 and perhaps lots of following bytes) has to be interpreted as
>an iso-8859-1 (or iso-8859-2) encoded thing (what you seem to imply
>here)?

No, I don't think I implied that. I stated that I want to handle "raw" 
filenames gracefully, whatever the encoding that I couldn't tell and don't 
care.

I understand what you are saying that if the original character set was not 
specified, there is no way you can detect it based on the bytes because of 
multitude of ambiguity. So, just call it "raw".

A lot of times, it is adequate to interpret the byte sequence with best 
attempt. g_filename_display_name() did that (so this answered your 
"question"), except that I didn't like how the illegal character (now, 
U+FFFD) is rendered -- it looks seriously "broken" and somewhat annoying. 
It is better to show illegal bytes in an easier to understand manner, like 
octal escape sequence or hex, or even a question mark.

Well, with g_utf8_validate(), it is trivial to implement a function that 
escape non-UTF-8 bytes to Hex. However, I then found out that TreeView, or 
more likely Pango, would unescape the %xx sequence (undo my attempt to help 
it) and choke!??!

I'm now quite sure that it is not worth the effort to handle a case like 
this, even though I think it should be do-able and pain-free. (If not quite 
as many things in GLib that get in the way.)

More random thoughts:
Is there a way to ask Pango to render illegal UTF-8 bytes as the more 
pleasant rectangle with hex number in it (as in the case when the font is 
not installed), rather than printing out cryptic messages on the terminal?

Thanks much.


-- 
Daniel Yek




>This could be as well an "ђ" or an "ς" (to cite some unibyte
>encodings. Going multibyte might be even more fun).
>
>That means you'll have to handle those decisions yourself. Maybe the
>libc routines iconv_open()/iconv()/iconv_close() help you with that
>(they try to convert up to an illegal sequence, stop there and tell you).
>
>HTH
>- -- tomás

_______________________________________________
gtk-app-devel-list mailing list
gtk-app-devel-list@gnome.org
http://mail.gnome.org/mailman/listinfo/gtk-app-devel-list

Reply via email to