Hello, I am getting a strange result from unit irregex having to do with matching character sets.
I recently upgraded to 4.13.0 to get the bug fix having to do with an extra empty list in the SRE: https://github.com/ashinn/irregex/pull/18. I was happy to find that "[]" bracketed character sets without "^" are working beautifully! I am, however, observing strange things with the "^" exclusion character. The ⾀ character has three bytes and when displayed in byte form, looks like `\342\276\200`: INPUT: (use irregex) ; Not doing (use utf8) because I want start-index and end-index to function correctly (irregex-match-substring (irregex-search (irregex "[^⾀]" 'utf8) "⾀⾀⾀")) EXPECTED OUTPUT: Considering a UTF-8 character as a single character anywhere it appears: `#f` Considering a UTF-8 character as a single character sometimes and a byte string sometimes: `<the first byte of ⾀>` (displayed as `\342`), or #f Considering a UTF-8 character as a byte string always: #f OUTPUT: `<the first byte of ⾀><the second byte of ⾀>` (looks like `\342\276`) EVEN WORSE: (irregex-match-substring (irregex-search (irregex "[^Ç]" 'utf8) "Ç")) ---> "Ç" ; A two-byte character Am I doing something wrong? Is "^" not designed to be used with multibyte characters? Why would it return two bytes and not 0, 1, or 3? Thank you!
_______________________________________________ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users