Useful task -- Character properties

2005-05-04 Thread Patrick R. Michaud
On Tue, May 03, 2005 at 09:22:11PM +0100, Nicholas Clark wrote:
 
 Whilst I confess that it's unlikely to be me here, if anyone has the time
 to contribute some help, do you have a list of useful self-contained tasks
 that people might be able to take on?

Actually, overnight I realized there's a relatively good-sized
project that needs figuring out -- identifying character properties
such as isalpha, islower, isprint, etc.  Here I'll briefly sketch
how I'd like it to work, and maybe someone enterprising can take 
things from there for us.

Currently Parrot offers quite a few ops for character properties --
namely is_whitespace, is_wordchar, is_digit, etc. and their
find_XXX counterparts.  While these are useful, the set is also
incomplete -- at the moment I haven't found anything that let's
us find alphabetic, uppercase, lowercase, etc. properties.  (If I've
just overlooked something, please point it out!)

I suppose Parrot could add a bunch of new is_alpha, is_upper, 
is_lower, etc.  ops, but having separate opcodes for every 
property actually complicates the design of PGE a fair bit
as well as makes a lot of very function-specific opcodes.  
What would *really* be useful would be to have three basic opcodes:

is_cclass(out INT, in INT, in STR, in INT)
Set $1 to 1 if the codepoint of $3 at position $4 is in
the character class(es) given by $2.

find_cclass(out INT, in INT, in STR, in INT, in INT)
Set $1 to the offset of the first codepoint matching
the character class(es) given by $2 in string $3, starting
at offset $4 for up to $5 codepoints.  If no matching
character is found, set $1 to -1.

find_not_cclass(out INT, in INT, in STR, in INT, in INT)
Set $1 to the offset of the first codepoint not matching
the character class(es) given by $2 in string $3, starting
at offset $4 for up to $5 codepoints.  If the substring
consists entirely of matching characters, set $1 to -1.

The character classes in $2 above are given by an integer bitmask,
defined according to the following table (or something like it --
I took this table from ctype.h on my system, then added a newline 
class):

 0x0001 - uppercase char
 0x0002 - lowercase char
 0x0004 - alphabetic char
 0x0008 - numeric character
 0x0010 - hexadecimal digit
 0x0020 - whitespace
 0x0040 - printing
 0x0080 - graphical
 0x0100 - blank (i.e., SPC and TAB)
 0x0200 - control character
 0x0400 - punctuation character
 0x0800 - alphanumeric character
 0x1000 - newline character

We have 32 bits available, so we could extend this table as needed.
And EVENTUALLY we'll probably need a more general interface 
to handle Unicode properties as well as character class compositions, 
but I speculate that we can do those either in a library, or
(if speed is needed) we can build a character class PMC type 
optimized for charsets and have:

is_cclass(out INT, in PMC, in STR, in INT)
find_cclass(out INT, in PMC, in STR, in INT, in INT)
find_not_cclass(out INT, in PMC, in STR, in INT, in INT)

But for now the integer representation of character classes
ought to be sufficient.

Anyway, that's another very useful self-contained task that 
I'd be glad to have a volunteer for.

Pm


Re: Useful task -- Character properties

2005-05-04 Thread Dan Sugalski
At 10:21 AM -0500 5/4/05, Patrick R. Michaud wrote:
On Tue, May 03, 2005 at 09:22:11PM +0100, Nicholas Clark wrote:
 Whilst I confess that it's unlikely to be me here, if anyone has the time
 to contribute some help, do you have a list of useful self-contained tasks
 that people might be able to take on?
Actually, overnight I realized there's a relatively good-sized
project that needs figuring out -- identifying character properties
such as isalpha, islower, isprint, etc.  Here I'll briefly sketch
how I'd like it to work, and maybe someone enterprising can take
things from
I'd planned on everything else going into constructed character 
classes. I'd figured the named classes would correspond to the major 
regex classes (things represented by \X sequences) while the 
constructed classes would handle everything else and more or less 
correspond to [] style sequences.

I thought I'd put in some docs to that effect, but apparently not. :(
--
Dan
--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Useful task -- Character properties

2005-05-04 Thread Patrick R. Michaud
On Wed, May 04, 2005 at 12:30:48PM -0400, Dan Sugalski wrote:
 At 10:21 AM -0500 5/4/05, Patrick R. Michaud wrote:
 Actually, overnight I realized there's a relatively good-sized
 project that needs figuring out -- identifying character properties
 such as isalpha, islower, isprint, etc.  Here I'll briefly sketch
 how I'd like it to work, and maybe someone enterprising can take
 things from
 
 I'd planned on everything else going into constructed character 
 classes. I'd figured the named classes would correspond to the major 
 regex classes (things represented by \X sequences) while the 
 constructed classes would handle everything else and more or less 
 correspond to [] style sequences.

Makes sense.  But somehow the named class versions of the ops
don't give me quite as much coverage as I'd like -- for example,
I can use find_digit to measure off a sequence of non-digit
characters (e.g., rx { \D* } ), but there's not a corresponding
find_non_digit opcode to let me measure off a set of digits
(e.g., rx { \d* } ).  

We'll still need a way to make constructed character classes
for upper, lower, and the like.  But I (or someone else) can 
probably build that component in PIR for now, just hardcoding the ASCII or
Latin-1 tables for the time being until we come up with something
else later.

Pm


Re: Useful task -- Character properties

2005-05-04 Thread Leopold Toetsch
Patrick R. Michaud wrote:
[ see below for some more ]
Actually, overnight I realized there's a relatively good-sized
project that needs figuring out -- identifying character properties
such as isalpha, islower, isprint, etc.  Here I'll briefly sketch
how I'd like it to work, and maybe someone enterprising can take 
things from there for us.

Currently Parrot offers quite a few ops for character properties --
namely is_whitespace, is_wordchar, is_digit, etc. and their
find_XXX counterparts.  While these are useful, the set is also
incomplete -- at the moment I haven't found anything that let's
us find alphabetic, uppercase, lowercase, etc. properties.  (If I've
just overlooked something, please point it out!)
I suppose Parrot could add a bunch of new is_alpha, is_upper, 
is_lower, etc.  ops, but having separate opcodes for every 
property actually complicates the design of PGE a fair bit
as well as makes a lot of very function-specific opcodes.  
What would *really* be useful would be to have three basic opcodes:

is_cclass(out INT, in INT, in STR, in INT)
Set $1 to 1 if the codepoint of $3 at position $4 is in
the character class(es) given by $2.
find_cclass(out INT, in INT, in STR, in INT, in INT)
Set $1 to the offset of the first codepoint matching
the character class(es) given by $2 in string $3, starting
at offset $4 for up to $5 codepoints.  If no matching
character is found, set $1 to -1.
find_not_cclass(out INT, in INT, in STR, in INT, in INT)
Set $1 to the offset of the first codepoint not matching
the character class(es) given by $2 in string $3, starting
at offset $4 for up to $5 codepoints.  If the substring
consists entirely of matching characters, set $1 to -1.
The character classes in $2 above are given by an integer bitmask,
defined according to the following table (or something like it --
I took this table from ctype.h on my system, then added a newline 
class):

 0x0001 - uppercase char
 0x0002 - lowercase char
 0x0004 - alphabetic char
 0x0008 - numeric character
 0x0010 - hexadecimal digit
 0x0020 - whitespace
 0x0040 - printing
 0x0080 - graphical
 0x0100 - blank (i.e., SPC and TAB)
 0x0200 - control character
 0x0400 - punctuation character
 0x0800 - alphanumeric character
 0x1000 - newline character
We have 32 bits available, so we could extend this table as needed.
And EVENTUALLY we'll probably need a more general interface 
to handle Unicode properties as well as character class compositions, 
but I speculate that we can do those either in a library, or
(if speed is needed) we can build a character class PMC type 
optimized for charsets and have:

is_cclass(out INT, in PMC, in STR, in INT)
find_cclass(out INT, in PMC, in STR, in INT, in INT)
find_not_cclass(out INT, in PMC, in STR, in INT, in INT)
But for now the integer representation of character classes
ought to be sufficient.
For hysterical raisins we actually have already two of char class 
interfaces (partially) implemented, e.g.

src/string.c:
  Parrot_string_is_digit(Interp *interpreter, STRING *s, INTVAL offset)
src/string_primitives.c
  Parrot_char_is_digit(Interp *interpreter, UINTVAL character)
The former is covered by an opocde in ops/string.ops and is the more 
useful form taking an string and an offset. The latter OTOH can call the 
ICU function, if ICU is present.

To cleanup that mess, we stick to Patricks plan, which implies in no 
specific order:

- implement the new opcodes, first in experimental.ops
- create an enum of the char classes in charset.h
- create the general API in that header too
- convert existing charset classifying tables to the new bits
- move the ICU functions to charset/unicode.c
- deprecate existing opcodes and APIs
- cleanup string_primitives.*
- convert existing tests
- write new tests
- write more news tests
- all I've forgotten to list
See also: src/  string.c string_primitives.c
  include/parrot/  charset.h string_primitives.h string_funcs.h
  charset/  *.c *.h   [1]
  ops/  string.ops
  t  op/string_cs.t
[1] especially char typetable[] and usage of it
Anyway, that's another very useful self-contained task that 
I'd be glad to have a volunteer for.
Yep.
Pm
leo


Re: Character Properties

2002-10-22 Thread Erik Steven Harrison
 
--

On Mon, 21 Oct 2002 16:49:57  
 Dan Sugalski wrote:

Almost. At least perl 5's macros look like C. Emacs' macro horrors 
make C look like Lisp...

This is because C is _clearly_ a dialect of Lisp . . . 

-Erik

-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk




Get 25MB of email storage with Lycos Mail Plus!
Sign up today -- http://www.mail.lycos.com/brandPage.shtml?pageId=plus 



Re: Character Properties

2002-10-22 Thread Larry Wall
On Tue, 22 Oct 2002, Erik Steven Harrison wrote:
: On Mon, 21 Oct 2002 16:49:57  
:  Dan Sugalski wrote:
: 
: Almost. At least perl 5's macros look like C. Emacs' macro horrors 
: make C look like Lisp...
: 
: This is because C is _clearly_ a dialect of Lisp . . . 

Yeah, look at all the extra parentheses around things like
conditionals and argument lists...

Larry




Re: Character Properties

2002-10-21 Thread Luke Palmer
 Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm
 X-Sender: [EMAIL PROTECTED] (Unverified)
 Date: Mon, 21 Oct 2002 11:37:51 -0400
 From: Dan Sugalski [EMAIL PROTECTED]
 X-SMTPD: qpsmtpd/0.12-dev, http://develooper.com/code/qpsmtpd/
 
 At 11:09 PM -0600 10/20/02, Luke Palmer wrote:
 What's the plan on having properties, or attributes (depending on how
 far we're taking it), on individual characters in a string?  I think
 it's an essential feature, as Lisp has shown us.  If there's an
 argument otherwise, I'm all ears.
 
 While they're certainly useful, I think essential's an awfully strong 
 word there. You'll note that, just off the top of my head, C, BASIC, 
 Fortran, Perl, Python, Java, Ruby, Pascal, Oberon, Modula (2 and 3), 
 Forth, Eiffel, Haskell, BLISS, C++, C#, COBOL, PL/I, APL, B, and BCPL 
 all don't do character properties/attributes.
 -- 
  Dan

Fair enough.  Then tell me how you solve this problem: You have a text
file in a string, that the user has marked several places in.  He's
referring to words for which he wants to keep bookmarks in.  Now, he
deletes text (using substr), and we want to keep the marks relative to
the words, not their positions.  This seems easy, yet there's not
necessarily an easy way to do it.  Uh oh, violating perl philosophy :)

Ok, how about this:  Is there a reason Inot to?  Or should I not go
there?

Luke



Re: Character Properties

2002-10-21 Thread Rafael Garcia-Suarez
Dan Sugalski wrote :
 
 And, FWIW, emacs is written in C. Granted a much macro-mutated 
 version of C, but C nonetheless.

Just like Perl 5 ;-)



RE: Character Properties

2002-10-21 Thread David Whipp
Jonathan Scott Duff wrote:
  Ok, how about this:  Is there a reason Inot to?  Or 
  should I not go there?
 
 Off hand, it sounds expensive. I don't see a way to only let 
 the people who use it incur the penalty, but my vision isn't
 the best in the world.

It should be possible to define the bookmark methods on the basic string
class to rebless the object onto a more powerful subclass. This way, there
is no overhead until the extra information is actually attached. (bless, not
copy, because there may be other references to the string).

Dave.



Re: Character Properties

2002-10-21 Thread Simon Cozens
[EMAIL PROTECTED] (David Whipp) writes:
 It should be possible to define the bookmark methods on the basic string
 class to rebless the object onto a more powerful subclass. 

That makes it a doubly good candidate for modulehood.

-- 
It's 106 miles from Birmingham, we've got an eighth of a tank of gas,
half a pack of Dorritos, it's dusk, and we're wearing contacts.
- Malcolm Ray



Re: Character Properties

2002-10-21 Thread Dan Sugalski
At 10:53 AM -0700 10/21/02, Austin Hastings wrote:

Yeah, but emacs isn't written in any of those languages.


What, you're using emacs as an argument *for* something? :-P

And, FWIW, emacs is written in C. Granted a much macro-mutated 
version of C, but C nonetheless.

--- Dan Sugalski [EMAIL PROTECTED] wrote:

 At 11:09 PM -0600 10/20/02, Luke Palmer wrote:
 What's the plan on having properties, or attributes (depending on
 how
 far we're taking it), on individual characters in a string?  I think
 it's an essential feature, as Lisp has shown us.  If there's an
 argument otherwise, I'm all ears.

 While they're certainly useful, I think essential's an awfully strong

 word there. You'll note that, just off the top of my head, C, BASIC,
 Fortran, Perl, Python, Java, Ruby, Pascal, Oberon, Modula (2 and 3),
 Forth, Eiffel, Haskell, BLISS, C++, C#, COBOL, PL/I, APL, B, and BCPL


  all don't do character properties/attributes.


--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk



Re: Character Properties

2002-10-21 Thread Dan Sugalski
At 2:20 PM -0600 10/21/02, Luke Palmer wrote:

  Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm

 X-Sender: [EMAIL PROTECTED] (Unverified)
 Date: Mon, 21 Oct 2002 11:37:51 -0400
 From: Dan Sugalski [EMAIL PROTECTED]
 X-SMTPD: qpsmtpd/0.12-dev, http://develooper.com/code/qpsmtpd/

 At 11:09 PM -0600 10/20/02, Luke Palmer wrote:
 What's the plan on having properties, or attributes (depending on how
 far we're taking it), on individual characters in a string?  I think
 it's an essential feature, as Lisp has shown us.  If there's an
 argument otherwise, I'm all ears.

 While they're certainly useful, I think essential's an awfully strong
 word there. You'll note that, just off the top of my head, C, BASIC,
 Fortran, Perl, Python, Java, Ruby, Pascal, Oberon, Modula (2 and 3),
 Forth, Eiffel, Haskell, BLISS, C++, C#, COBOL, PL/I, APL, B, and BCPL

  all don't do character properties/attributes.

Fair enough.  Then tell me how you solve this problem: You have a text
file in a string, that the user has marked several places in.  He's
referring to words for which he wants to keep bookmarks in.  Now, he
deletes text (using substr), and we want to keep the marks relative to
the words, not their positions.  This seems easy, yet there's not
necessarily an easy way to do it.  Uh oh, violating perl philosophy :)


I didn't call the problem unreasonable, I was objecting to its 
characterization as an essential feature. It isn't. A useful thing, 
definitely, but there are a lot of those. It's hardly essential any 
more than, say, a hash that automagically maps to the current 
directory's files (iteratively, of course, catching all the 
subdirectories) is essential

While perl is a language that makes it easy to do useful things, it 
doesn't mean that all useful things should be easy to do in perl. 
Given how large the set of Useful Things is, that's not unreasonable.
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Character Properties

2002-10-21 Thread Dan Sugalski
At 7:22 PM + 10/21/02, Rafael Garcia-Suarez wrote:

Dan Sugalski wrote :


 And, FWIW, emacs is written in C. Granted a much macro-mutated
 version of C, but C nonetheless.


Just like Perl 5 ;-)


Almost. At least perl 5's macros look like C. Emacs' macro horrors 
make C look like Lisp...
--
Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk


Re: Character Properties

2002-10-21 Thread Jonathan Scott Duff
On Mon, Oct 21, 2002 at 02:20:56PM -0600, Luke Palmer wrote:
 Fair enough.  Then tell me how you solve this problem: You have a text
 file in a string, that the user has marked several places in.  He's
 referring to words for which he wants to keep bookmarks in.  Now, he
 deletes text (using substr), and we want to keep the marks relative to
 the words, not their positions.  This seems easy, yet there's not
 necessarily an easy way to do it.  Uh oh, violating perl philosophy :)

Sounds like a good candidate for modulehood.

 Ok, how about this:  Is there a reason Inot to?  Or should I not go
 there?

Off hand, it sounds expensive. I don't see a way to only let the people
who use it incur the penalty, but my vision isn't the best in the world.

-Scott
-- 
Jonathan Scott Duff
[EMAIL PROTECTED]



Re: Character Properties

2002-10-21 Thread Luke Palmer
 I didn't call the problem unreasonable, I was objecting to its 
 characterization as an essential feature. It isn't. A useful thing, 
 definitely, but there are a lot of those. It's hardly essential any 
 more than, say, a hash that automagically maps to the current 
 directory's files (iteratively, of course, catching all the 
 subdirectories) is essential

I see what you mean now.  I had A Momentary Lapse of Reason, in which
I forgot modules could do such things.  It's very suited to a
module---not very common, but very important to certain problems.

Luke



Character Properties

2002-10-20 Thread Luke Palmer
What's the plan on having properties, or attributes (depending on how
far we're taking it), on individual characters in a string?  I think
it's an essential feature, as Lisp has shown us.  If there's an
argument otherwise, I'm all ears.

Luke