[Chicken-users] problems string-trimming on UTF8

Kristian Lein-Mathisen Fri, 27 Jan 2017 05:37:21 -0800

Dear CHICKEN mailing list,

I encountered a strange issue with string-trim-right and some UTF8 string:


$ csi -R srfi-13 -p '(string-trim "Zazà")'
Zazà


So far so good!

$ csi -R srfi-13 -p '(string-trim-right "Zazà")'
Zaz�


Oh no, what happened?

$ csi -R utf8 -R srfi-13 -p '(string-trim-right "Zazà")'
Zaz�


utf8 doesn't seem to do it! But utf8, at least, gets the string-length
right:

$ csi -R srfi-13 -p '(string-length "Zazà")'
5
$ csi -R utf8 -R srfi-13 -p '(string-length "Zazà")'
4


It took me a while to figure out what was going on. These are the bytes of
Zazà:

$ printf 'Zazà' | xxd
00000000: 5a61 7ac3 a0                             Zaz..


So it seems like string-trim-right just looks at the last byte, \xa0 which
is a non-breaking space <https://en.wikipedia.org/wiki/Non-breaking_space> in
itself, and then dropping that off. It should be looking at the last utf8
codepoint instead.

I don't know if this is a known bug or if I've come across something
undiscovered. I suppose the fix belongs in the utf8 egg.

Thanks!
K.

_______________________________________________
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users

[Chicken-users] problems string-trimming on UTF8

Reply via email to