Hey Matt,
On 27 March 2015 21:19 CET, Matt Gushee wrote:
That's a fair question. I was working on a toy XML parser as a learning
exercise, and I thought hmm ... this should support UTF-8. So I attempted
to use utf8-srfi-14 in place of regular srfi-14; then certain parsing
functions didn't work as expected. I also looked at the comparse source,
and saw that it imports [non-UTF8] srfi-13 and -14.
ah, that's what you are referring to, I see! It's like that because I
didn't want to force a utf8 dependency on the user. The affected parser
combinators are easily constructed from more primitive ones, though. For
example, you could define a utf8 char-set aware `in' combinator like
this:
(use comparse
(use utf8-srfi-14)
(define (utf8-in cs)
(satisfies (lambda (c) (char-set-contains? cs c
#;3 (parse (is nichi) sake)
#f
#parser-input #\� #\� #\� #\� #\� #\� #\� #\� #\�
; 2 values
Not useful for arbitrary-language text ...
This doesn't work because `sake' is a string and Comparse operates on
the byte level by default (in accordance with CHICKEN core string
procedures). You need to decode the input as UTF-8 to make it work as
expected, e.g. using the utf8 egg's `string-list' procedure:
#;162 (parse (is nichi) (string-list sake))
#\x65e5
#parser-input lazy-seq #\x672c #\x9152
#;4 (parse item sake)
#\
#parser-input #\� #\� #\� #\� #\� #\� #\� #\�
; 2 values
Same here.
Not so good. BTW, I also tried wrapping the text with (-parser-input ...).
Didn't seem to make any difference.
Yeah, `parse' does that implicitly for you.
#;7 (parse ident sake)
(#\� ())
#parser-input #\� #\� #\� #\� #\� #\� #\� #\�
; 2 values
???
Same as above applies.
#;9 (parse ident h1)
#f
#parser-input #\h #\1
; 2 values
Wrong. Or at least, quite unexpected.
This is because you are passing a utf8 char-set to `in' now which does
not satisfy srfi-14's `char-set?' predicate so it will treat it as a
possible input value to match against with `eq?' (or `memq', to be
precise). Maybe we shoud have two different `in's instead of a single
polymorphic one to make it a bit more strict. It seemed like a nice API
at the time ;-)
#;10 (parse ident sake)
#f
#parser-input #\� #\� #\� #\� #\� #\� #\� #\� #\�
; 2 values
Also not what we want.
Same cause here, of course.
So, while I can see that it is possible to use certain combinators with
non-ASCII text, this does not seem like proper UTF-8 support to me. Or is
there some way to set up the environment or prepare the input that would
prevent these issues?
We could create a comparse-utf8 egg to facilitate this. It's not
currently on my agenda but I will put it in my Comparse notes for future
reference. If you feel inclined to create one, I'm happy to provide you
with code review and feedback!
Hope that helps
Moritz
signature.asc
Description: PGP signature
___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users