2006/1/27, Isak Savo <[EMAIL PROTECTED]>: > 2006/1/27, Axel Liljencrantz <[EMAIL PROTECTED]>: > > 2006/1/26, Isak Savo <[EMAIL PROTECTED]>: > > > Something like this: > > > $ ls<TAB> > > > ls (List contents of directory) > > > lsfoo (<<No description found>>) > > > lspci (List all PCI devices) > > > ... > > > > > > Perhaps the "No description found" part could be error colored or > > > something. It could also be completely omitted, leaving just an empty > > > parenthesis. > > > > In this particular case, that can be done. I was thinking about the > > general case. For example, on my machine the 'locale -a' command seems > > to outputs locale names in the native charset of the locale, which > > will sometimes result in invalid strings. What should a command like: > > > > for i in (locales -a) > > ... > > end > > > > do? > > > > Should it skip the the broken strings. Try to guess what they are? > > Skip the broken characters? Maybe the whole command should fail? > > Why would you want to skip them? Imagine the following > for i in (locales -a) > process_string($i); > end > > process_string() might handle, or even depend on, $i being in weird > charsets. I'm not familiar with the fish internals, but the logical > thing would be to not care unless the string is being printed to the > user. > > Isak > > PS. I have no idea how other shells handle this. I'm basically arguing > theoretical points here :-) >
After occasionally thinking about this problem for something like two years, I today reached enlightenment. I have come up with a way to support an arbitrary byte sequence, completely independant of the specified character set, in an application that internally uses wide character strings. My approach uses a unicode private use area to give each illegal byte value a unique wide-string representation, and making sure that conversion to/from wide strings respect this conversion as well. This means that the 'encoding muck' can be handled exclusively by the character set conversion functions, this also means that all regular wide character functions, including length calculations, keep working. Lingering problems: * All unknown/illegal characters are assumed to be one byte long. This will mean that wildcard matches using '?' may give the wrong result when the illegal characters used a multibyte encoding. This is pretty hard to avoid, and it should be extremely rare. * The terminal will still have to somehow display broken characters. I'm thinking that maybe completion code should use a special backlash escape for broken characters. Perhaps \Xxx, where xx is the hexadecimal bytecode for the illegal character. (Note the uppercase X, to specify a byte, in contrast to \xxx, ie. a lower case x, that specifys a byte which will be encoded to the locales character set, possibly making it more than one byte long. There is a patch in the Darcs repo implementing the above behaviour. Everything seems to work pretty nicely. As near as I can tell, this solution removes all drawbacks associated with wide characters except the increased memory usage. In a UTF-8 locale, one can try this new functionality by typing something like: mkdir foo touch foo/\Xaa touch foo\Xbb cat foo/* echo foo/<TAB> Everything should work as expected in the above example, event though the two files in the above example do not have filenames that are valid UTF-8. -- Axel
