Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Hello! As I said, I'm a neophyte. My character classes were based around [a-zA-z] etc. So you can readily see why the pattern would have quickly become unreasonably complex. If you don't need any exotic characters, just ASCII (and, probably, a small superset of Unicode), character classes would be extremely simple: (use irregex utf8) ; Cyrillic letters range: (define cyrl '(/ #\u0400 #\u05012)) (define (split-into-classes s) (irregex-extract `(or (+ (or alpha ,cyrl)) (+ num) (+ punct) (+ white) (+ (~ alpha num punct white ,cyrl))) s)) Note that I'm also a kind of a neophyte, so there may be a better way to do this. :) Then you can use this procedure like this: ; In Linux/Cygwin you can input Hello world! Да. directly, but not in Windows console (split-into-classes Hello world! \u0414\u0430.) = (Hello world ! Да .) But extending this procedure to cover the whole Unicode would be tricky. I was planning on using Chicken to learn scheme, since R7SR is supposed to be based more on R5SR than on R6SR, but maybe it's better to learn using Racket. It doesn't matter what tools you use as long as you have a desire to learn. I was personally put off by Racket's extremely slow loading time. Also note that I believe Racket doesn't have a built-in solution to split a string into character classes either. (I *do* need to use utf-8 in lots of places, and an incomplete implementation while I was learning would be ... unpleasant. Particularly if the user documentation presumed that it *was* complete.) What made you think it's incomplete? :o Windows console's UTF-8 support is incomplete, but on the Chicken's side everything is OK. -- С уважением, Дмитрий Кушнарёв ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Errata. ^^ and, probably, a small superset of Unicode subset (define cyrl '(/ #\u0400 #\u05012)) Of course this should be #\u0512 -- Yours sincerely, Dmitry Kushnariov ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Hi Charles, Charles Hixson charleshi...@earthlink.net writes: What I want is something that will split strings at character class boundaries (alpha, numeric, punctuation, white, other), and NOT discard the places where it splits. Is there a better choice than irregex? The pattern for doing that on a utf-8 string gets quite messy. Irregex should do the trick. How does it get messy for you? Can you show how you tried it? Moritz ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
On Fri, Jul 20, 2012 at 03:05:39PM +0400, Дмитрий wrote: Hello. Does IrRegex support Unicode character classes? Generally, it does and there are at least a few tests for these. However, I've never worked with these kinds of characters myself, so I don't know how well they're supported. The docs also explicitly have a warning that case insensitive matches do not work for non-ASCII characters, so YMMV. E.g. Will IrRegex consider accented letters (á) or Cyrillic letters (я) as alpha? Wil IrRegex consider Chinese wide space ( ) as space? Will IrRegex consider Chinese brackets (「」【】) as punct? No, almost all of the named character classes are ASCII only. If it doesn't, the regexp is going to be EXTREMELY messy [in fact, I believe it may better to build such a regexp automatically then]. There are a few (undocumented?!) helper character classes like utf8-tail-char, utf8-2-char, utf8-3-char and utf8-4-char. See the source for details. I don't know what Alex's plan is for UTF8 support, but if you're willing to put in the effort to define character classes for the ranges you mentioned, possibly you could contribute them to the (upstream) irregex project. If the definition of these sets are big, maybe we could turn it into an optional add-in library. I’m on Windows, so I can’t check it (when I use UTF-8 console via chcp 65001, for some reason Chicken seems to fail on every string with operation non-ascii string — even on a simple (display Привет)). This could be due to terminal and locale settings. Cheers, Peter -- http://sjamaan.ath.cx -- The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music. -- Donald Knuth ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
Hello again. :) I don't know what Alex's plan is for UTF8 support, but if you're willing to put in the effort to define character classes for the ranges you mentioned, possibly you could contribute them to the (upstream) irregex project. If the definition of these sets are big, maybe we could turn it into an optional add-in library. Well, the problem is that the Unicode is not really logical (like ASCII is), so there will be lots of very small subranges, and mathing these will probably be ineffective. As for the character classes, they can be generated quite easily from the UnicodeData.txt[1] file. We can get a general category[2] from this file by sth like (string-symbol (caddr (string-split line ,))); then we just need to map the categories into appropriate character classes (e.g. Lu belongs to upper, alpha, alphanum, graph), etc. and merge characters if the characters of the same categories if they have adjacent codes. It's quite easy to do. If I'm not lazy I'll do this this weekend. This could be due to terminal and locale settings. Well, UTF-8 in Windows console is known to be seriously broken. If I were to need an UTF-8 console, I would install the Cygwin terminal; but right now I'm mostly happy with cp866. [1] http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt [2] http://www.unicode.org/reports/tr44/#General_Category_Values -- Yours sincerely, Dmitry Kushnariov ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
On Fri, Jul 20, 2012 at 8:56 PM, Дмитрий dm...@yandex.ru wrote: As for the character classes, they can be generated quite easily from the UnicodeData.txt[1] file. We can get a general category[2] from this file by sth like (string-symbol (caddr (string-split line ,))); then we just need to map the categories into appropriate character classes (e.g. Lu belongs to upper, alpha, alphanum, graph), etc. and merge characters if the characters of the same categories if they have adjacent codes. It's quite easy to do. If I'm not lazy I'll do this this weekend. Full unicode character classes and case handling are already in the utf8 egg. These are not yet integrated with irregex because irregex is written to be portable across any Scheme, and so it uses its own char-set implementation. When R7RS is released I'll re-package irregex accordingly. Unfortunately, while the utf8 char-sets are very compact, the DFA conversion of large, sparse Unicode char-sets is quite large. I'd like eventually to make a non-backtracking NFA regex matcher which only compiles to DFA when you really need the speed. In the meantime, a fast lookup table for the script of a character would be nice, and this could be use to tokenize a string of mixed-language text. I thought I had this and can't seem to find it anywhere... -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
On 07/20/2012 04:05 AM, Дмитрий wrote: Hello. Does IrRegex support Unicode character classes? E.g. Will IrRegex consider accented letters (á) or Cyrillic letters (я) as alpha? Wil IrRegex consider Chinese wide space ( ) as space? Will IrRegex consider Chinese brackets (「」【】) as punct? If it doesn't, the regexp is going to be EXTREMELY messy [in fact, I believe it may better to build such a regexp automatically then]. I’m on Windows, so I can’t check it (when I use UTF-8 console via chcp 65001, for some reason Chicken seems to fail on every string with operation non-ascii string — even on a simple (display Привет)). -- Yours sincerely, Dmitry Kushnariov As I said, I'm a neophyte. My character classes were based around [a-zA-z] etc. So you can readily see why the pattern would have quickly become unreasonably complex. I didn't find any definition of other character classes (well, not one that meant anything) and given the discussion, I think that they wouldn't have worked if I'd gotten to the point of testing them. I was planning on using Chicken to learn scheme, since R7SR is supposed to be based more on R5SR than on R6SR, but maybe it's better to learn using Racket. I *trust* the conversion won't be too difficult. (I *do* need to use utf-8 in lots of places, and an incomplete implementation while I was learning would be ... unpleasant. Particularly if the user documentation presumed that it *was* complete.) -- Charles Hixson ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
* Charles Hixson charleshi...@earthlink.net [120720 20:24]: As I said, I'm a neophyte. My character classes were based around [a-zA-z] etc. So you can readily see why the pattern would have quickly become unreasonably complex. I didn't find any definition of other character classes (well, not one that meant anything) and given the discussion, I think that they wouldn't have worked if I'd gotten to the point of testing them. The character classes can be found here: http://api.call-cc.org/doc/srfi-14 for Latin-1, the utf8 egg contains, well utf-8 ones... HTH, Christian -- 9 out of 10 voices in my head say, that I am crazy, one is humming. ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users
Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want
On Sat, Jul 21, 2012 at 3:19 AM, Charles Hixson charleshi...@earthlink.net wrote: On 07/20/2012 04:05 AM, Дмитрий wrote: Hello. Does IrRegex support Unicode character classes? E.g. Will IrRegex consider accented letters (á) or Cyrillic letters (я) as alpha? Wil IrRegex consider Chinese wide space ( ) as space? Will IrRegex consider Chinese brackets (「」【】) as punct? If it doesn't, the regexp is going to be EXTREMELY messy [in fact, I believe it may better to build such a regexp automatically then]. I’m on Windows, so I can’t check it (when I use UTF-8 console via chcp 65001, for some reason Chicken seems to fail on every string with operation non-ascii string — even on a simple (display Привет)). -- Yours sincerely, Dmitry Kushnariov As I said, I'm a neophyte. My character classes were based around [a-zA-z] etc. So you can readily see why the pattern would have quickly become unreasonably complex. I didn't find any definition of other character classes (well, not one that meant anything) and given the discussion, I think that they wouldn't have worked if I'd gotten to the point of testing them. I was planning on using Chicken to learn scheme, since R7SR is supposed to be based more on R5SR than on R6SR, but maybe it's better to learn using Racket. I *trust* the conversion won't be too difficult. (I *do* need to use utf-8 in lots of places, and an incomplete implementation while I was learning would be ... unpleasant. Particularly if the user documentation presumed that it *was* complete.) The utf8 implementation is not incomplete. It's just not the default. -- Alex ___ Chicken-users mailing list Chicken-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/chicken-users