Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-21 Thread Дмитрий
Hello!

 As I said, I'm a neophyte. My character classes were based around
 [a-zA-z] etc. So you can readily see why the pattern would have
 quickly become unreasonably complex.
If you don't need any exotic characters, just ASCII (and, probably, a small 
superset of Unicode), character classes would be extremely simple:
(use irregex utf8)

; Cyrillic letters range:
(define cyrl '(/ #\u0400 #\u05012))

(define (split-into-classes s)
  (irregex-extract `(or (+ (or alpha ,cyrl)) (+ num)
   (+ punct) (+ white)
   (+ (~ alpha num punct white ,cyrl))) s))

Note that I'm also a kind of a neophyte, so there may be a better way to do 
this. :)

Then you can use this procedure like this:
; In Linux/Cygwin you can input Hello world! Да. directly, but not in Windows 
console
(split-into-classes Hello world! \u0414\u0430.)
= (Hello   world !   Да .)

But extending this procedure to cover the whole Unicode would be tricky.

 I was planning on using Chicken to learn scheme, since R7SR is supposed
 to be based more on R5SR than on R6SR, but maybe it's better to learn
 using Racket.
It doesn't matter what tools you use as long as you have a desire to learn. I 
was personally put off by Racket's extremely slow loading time.

Also note that I believe Racket doesn't have a built-in solution to split a 
string into character classes either.

 (I *do* need to use utf-8 in lots of places, and an incomplete implementation
 while I was learning would be ... unpleasant. Particularly if the user
 documentation presumed that it *was* complete.)
What made you think it's incomplete? :o

Windows console's UTF-8 support is incomplete, but on the Chicken's side 
everything is OK.

 -- 
С уважением,
Дмитрий Кушнарёв

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-21 Thread Дмитрий
Errata. ^^

 and, probably, a small superset of Unicode
subset

 (define cyrl '(/ #\u0400 #\u05012))
Of course this should be #\u0512

 -- 
Yours sincerely,
Dmitry Kushnariov

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Moritz Heidkamp
Hi Charles,

Charles Hixson charleshi...@earthlink.net writes:
 What I want is something that will split strings at character class
 boundaries (alpha, numeric, punctuation, white, other), and NOT
 discard the places where it splits.  Is there a better choice than
 irregex?  The pattern for doing that on a utf-8 string gets quite
 messy.

Irregex should do the trick. How does it get messy for you? Can you
show how you tried it?

Moritz

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Peter Bex
On Fri, Jul 20, 2012 at 03:05:39PM +0400, Дмитрий wrote:
 Hello.
 
 Does IrRegex support Unicode character classes?

Generally, it does and there are at least a few tests for these.
However, I've never worked with these kinds of characters myself,
so I don't know how well they're supported. The docs also explicitly
have a warning that case insensitive matches do not work for non-ASCII
characters, so YMMV.

 E.g. Will IrRegex consider accented letters (á) or Cyrillic letters (я) as 
 alpha? Wil IrRegex consider Chinese wide space ( ) as space? Will IrRegex 
 consider Chinese brackets (「」【】) as punct?

No, almost all of the named character classes are ASCII only.

 If it doesn't, the regexp is going to be EXTREMELY messy [in fact, I believe 
 it may better to build such a regexp automatically then].

There are a few (undocumented?!) helper character classes like
utf8-tail-char, utf8-2-char, utf8-3-char and utf8-4-char.  See the source
for details.

I don't know what Alex's plan is for UTF8 support, but if you're willing
to put in the effort to define character classes for the ranges you
mentioned, possibly you could contribute them to the (upstream) irregex
project.  If the definition of these sets are big, maybe we could turn it
into an optional add-in library.

 I’m on Windows, so I can’t check it (when I use UTF-8 console via chcp 65001, 
 for some reason Chicken seems to fail on every string with operation 
 non-ascii string — even on a simple (display Привет)).

This could be due to terminal and locale settings.

Cheers,
Peter
-- 
http://sjamaan.ath.cx
--
The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music.
-- Donald Knuth

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Дмитрий
Hello again. :)

 I don't know what Alex's plan is for UTF8 support, but if you're willing
 to put in the effort to define character classes for the ranges you
 mentioned, possibly you could contribute them to the (upstream) irregex
 project. If the definition of these sets are big, maybe we could turn it
 into an optional add-in library.
  Well, the problem is that the Unicode is not really logical (like ASCII is),
so there will be lots of very small subranges, and mathing these will
probably be ineffective.



  As for the character classes, they can be generated quite easily from the
UnicodeData.txt[1] file. We can get a general category[2] from this file
by sth like (string-symbol (caddr (string-split line ,))); then we just
need to map the categories into appropriate character classes (e.g. Lu
belongs to upper, alpha, alphanum, graph), etc. and merge characters if the
characters of the same categories if they have adjacent codes.
  It's quite easy to do. If I'm not lazy I'll do this this weekend.

 This could be due to terminal and locale settings.
Well, UTF-8 in Windows console is known to be seriously broken. If I were to
need an UTF-8 console, I would install the Cygwin terminal; but right now
I'm mostly happy with cp866.

[1] http://www.unicode.org/Public/5.2.0/ucd/UnicodeData.txt
[2] http://www.unicode.org/reports/tr44/#General_Category_Values

 -- 
Yours sincerely,
Dmitry Kushnariov

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Alex Shinn
On Fri, Jul 20, 2012 at 8:56 PM, Дмитрий dm...@yandex.ru wrote:

   As for the character classes, they can be generated quite easily from the
 UnicodeData.txt[1] file. We can get a general category[2] from this file
 by sth like (string-symbol (caddr (string-split line ,))); then we just
 need to map the categories into appropriate character classes (e.g. Lu
 belongs to upper, alpha, alphanum, graph), etc. and merge characters if the
 characters of the same categories if they have adjacent codes.
   It's quite easy to do. If I'm not lazy I'll do this this weekend.

Full unicode character classes and case handling
are already in the utf8 egg.

These are not yet integrated with irregex because
irregex is written to be portable across any Scheme,
and so it uses its own char-set implementation.  When
R7RS is released I'll re-package irregex accordingly.

Unfortunately, while the utf8 char-sets are very
compact, the DFA conversion of large, sparse Unicode
char-sets is quite large.  I'd like eventually to make
a non-backtracking NFA regex matcher which only
compiles to DFA when you really need the speed.

In the meantime, a fast lookup table for the
script of a character would be nice, and this could
be use to tokenize a string of mixed-language text.
I thought I had this and can't seem to find it anywhere...

-- 
Alex

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Charles Hixson

On 07/20/2012 04:05 AM, Дмитрий wrote:

Hello.

Does IrRegex support Unicode character classes? E.g. Will IrRegex consider accented letters (á) or Cyrillic 
letters (я) as alpha? Wil IrRegex consider Chinese wide space ( ) as space? Will 
IrRegex consider Chinese brackets (「」【】) as punct? If it doesn't, the regexp is going to be 
EXTREMELY messy [in fact, I believe it may better to build such a regexp automatically then].

I’m on Windows, so I can’t check it (when I use UTF-8 console via chcp 65001, for some 
reason Chicken seems to fail on every string with operation non-ascii string — even on a 
simple (display Привет)).


  --
Yours sincerely,
Dmitry Kushnariov

   
As I said, I'm a neophyte.  My character classes were based around 
[a-zA-z]  etc.  So you can readily see why the pattern would have 
quickly become unreasonably complex.  I didn't find any definition of 
other character classes (well, not one that meant anything) and given 
the discussion, I think that they wouldn't have worked if I'd gotten to 
the point of testing them.


I was planning on using Chicken to learn scheme, since R7SR is supposed 
to be based more on R5SR than on R6SR, but maybe it's better to learn 
using Racket.  I *trust* the conversion won't be too difficult.  (I *do* 
need to use utf-8 in lots of places, and an incomplete implementation 
while I was learning would be ... unpleasant.  Particularly if the user 
documentation presumed that it *was* complete.)


--
Charles Hixson


___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Christian Kellermann
* Charles Hixson charleshi...@earthlink.net [120720 20:24]:
 As I said, I'm a neophyte.  My character classes were based around
 [a-zA-z]  etc.  So you can readily see why the pattern would have
 quickly become unreasonably complex.  I didn't find any definition
 of other character classes (well, not one that meant anything) and
 given the discussion, I think that they wouldn't have worked if I'd
 gotten to the point of testing them.


The character classes can be found here: http://api.call-cc.org/doc/srfi-14 for 
Latin-1, the utf8 egg contains, well utf-8 ones...

HTH,

Christian

--
9 out of 10 voices in my head say, that I am crazy,
one is humming.

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users


Re: [Chicken-users] Neophyte in scheme: string-split not quite what I want

2012-07-20 Thread Alex Shinn
On Sat, Jul 21, 2012 at 3:19 AM, Charles Hixson
charleshi...@earthlink.net wrote:
 On 07/20/2012 04:05 AM, Дмитрий wrote:

 Hello.

 Does IrRegex support Unicode character classes? E.g. Will IrRegex consider
 accented letters (á) or Cyrillic letters (я) as alpha? Wil IrRegex
 consider Chinese wide space ( ) as space? Will IrRegex consider Chinese
 brackets (「」【】) as punct? If it doesn't, the regexp is going to be
 EXTREMELY messy [in fact, I believe it may better to build such a regexp
 automatically then].

 I’m on Windows, so I can’t check it (when I use UTF-8 console via chcp
 65001, for some reason Chicken seems to fail on every string with operation
 non-ascii string — even on a simple (display Привет)).


   --
 Yours sincerely,
 Dmitry Kushnariov



 As I said, I'm a neophyte.  My character classes were based around
 [a-zA-z]  etc.  So you can readily see why the pattern would have quickly
 become unreasonably complex.  I didn't find any definition of other
 character classes (well, not one that meant anything) and given the
 discussion, I think that they wouldn't have worked if I'd gotten to the
 point of testing them.

 I was planning on using Chicken to learn scheme, since R7SR is supposed to
 be based more on R5SR than on R6SR, but maybe it's better to learn using
 Racket.  I *trust* the conversion won't be too difficult.  (I *do* need to
 use utf-8 in lots of places, and an incomplete implementation while I was
 learning would be ... unpleasant.  Particularly if the user documentation
 presumed that it *was* complete.)

The utf8 implementation is not incomplete.  It's just
not the default.

-- 
Alex

___
Chicken-users mailing list
Chicken-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/chicken-users