Re: Possible Unicode Problems in Busybox - Collect and Discussion

Denys Vlasenko Wed, 13 Aug 2014 09:08:35 -0700

On Wed, Aug 13, 2014 at 3:42 PM, Harald Becker <ra...@gmx.de> wrote:
>
>> ive seen several implementations which use mbtowc functions to test some
>> special chars, this is not correct for utf 8 in my opinion.
>
>
> To count the number of UTF-8 characters is really simple, just count all
> bytes except those with value in range 0x80 to 0xBF. This has two exceptions
> 0xFE and 0xFF which are no official UTF-8 characters, but I think it's not
> wrong to count and behave as such.
>
>
> counting can be done with one logical an one compare instruction:
>
> if ((c ^ 0x40) < 0xC0) n++


include/{libbb,unicode}.h already have a bunch of helpers
to do unicode_strlen(), and a few other typical functions:

typedef struct uni_stat_t {
        unsigned byte_count;
        unsigned unicode_count;
        unsigned unicode_width;
} uni_stat_t;
/* Returns a string with unprintable chars replaced by '?' or
 * SUBST_WCHAR. This function is unicode-aware. */
const char* FAST_FUNC printable_string(uni_stat_t *stats, const char *str);

/* Number of unicode chars. Falls back to strlen() on invalid unicode */
size_t FAST_FUNC unicode_strlen(const char *string);
/* Width on terminal */
size_t FAST_FUNC unicode_strwidth(const char *string);
enum {
        UNI_FLAG_PAD = (1 << 0),
};
char* FAST_FUNC unicode_conv_to_printable(uni_stat_t *stats, const char *src);
char* FAST_FUNC unicode_conv_to_printable_fixedwidth(/*uni_stat_t
*stats,*/ const char *src, unsigned width);
_______________________________________________
busybox mailing list
busybox@busybox.net
http://lists.busybox.net/mailman/listinfo/busybox

Re: Possible Unicode Problems in Busybox - Collect and Discussion

Reply via email to