date:20040413

Re: [perl #28393] [PATCH] Tcl pmcs

2004-04-13 Thread Leopold Toetsch

Will Coleda <[EMAIL PROTECTED]> wrote:
> Did the makefile change make it in?

Not the hacks to get it running. As said the way to go is to concatenate
the sources and chain the loadlib _load() functions. This has two
advantages: no platform dependent linker troubles to resolve symbols of
multiple shared libs, and less resource usage for library loading.

leo

Re: [perl #28502] [PATCH] dynclasses/README

2004-04-13 Thread Leopold Toetsch

Will Coleda <[EMAIL PROTECTED]> wrote:

> Here's an updated version of dynclasses/README that sums up recent
> notes, and PODifies the doc.

Thanks, applied (2nd version).
leo

Re: Plans for string processing

2004-04-13 Thread Jarkko Hietaniemi

Matt Fowles wrote:

> Dan~
> 
> I know that you are not technically required to defend your position, 
> but I would like an explanation of one part of this plan.
> 
> Dan Sugalski wrote:
> 
>>4) We will *not* use ICU for core functions. (string to number or number 
>>to string conversions, for example)
> 
> 
> Why not?  It seems like we would just be reinventing a rather large 
> wheel here.

Without having looked at what ICU supplies in this department I would
guess it's simply because of the overhead.  atoi() is probably quite a
bit faster than pulling in the full support for TIBETAN HALF THREE.

(Though to be honest I think Parrot shouldn't trust on atoi() or any
of those guys: Perl 5 has tought us not to put trust too much on them.
Perl 5 these days parses all the integer formats itself.)

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Jeff Clites

On Apr 12, 2004, at 9:54 AM, Leopold Toetsch (via RT) wrote:

# New Ticket Created by  Leopold Toetsch
# Please include the string:  [perl #28494]
# in the subject line of all future correspondence about this issue.
# http://rt.perl.org:80/rt3/Ticket/Display.html?id=28494 >
Attached patch:
* adds a new test file for Unicode-related string tests
* reimplements string_unescape_cstring which uses now ICU for the work
* fixes a bug in string_compare with equally length strings
It's also by far more efficient then the old code.

TODO: move it out of string.c, docs.

Jeff, please have a look at it.
It looks very similar to what I had come up with. The only important 
differences are:

1) My version handles the case of code points > 0x as well. (The 
string_append_chr function encapsulates the logic of dealing with the 
"anything above 0xFF" case, but needs to be rewritten to improve 
efficiency.)

2) When I was implementing the previous version of 
string_unescape_cstring, I'm pretty sure I had a reason for doing that 
string_constant_copy at the end, rather than creating a constant string 
at the beginning. I'm not recalling 100% why, but I believe that there 
were problems in the case where the string has to expand its storage 
because there are characters > 0xFF, if had been created as a constant.

Just a tiny note:

instead of this:
 result->bufused = d * (had_int16 ? 2 : 1);
you can do this:
 result->bufused = string_max_bytes(interpreter, result, 
result->strlen);

to update the bufused to match strlen.

I'm attaching a patch which contains the version I had written, and 
also includes my changes from [perl #28473], which I didn't see make it 
to the list. Take a look, and you can probably take the best parts of 
both--I'm sure there are a few places where your version is more 
efficient. (Also, I have the couple of bits which call directly into 
the ICU API factored out into string_primitives.c)

BTW, I have some benchmarks that I will clean up and send in to go with 
your tests.

JEff



unescaping-and-icu-config.patch
Description: Binary data

compile, invoke and then something else

2004-04-13 Thread Bernhard Schmalhofer

Hi,

I am trying to implement the 'eval' macro im Parrot m4. The Parrot m4
interpreter is implemented in PIR. The 'eval' is a simple interpreter for
integer artithmetic and forms thus a micro language within a mini language.

For implementing the 'eval' macro I took following approach:

i. Implement a 'm4_eval_compiler' in C, based on 'examples/compilers/japh.c'  
ii. When encountering something likeexamples/compilers

eval(`1 + 1')

I extract the string "1 + 1". and compile it
iii. The compiler returns a .Sub which can be invoked.

In a test script this looks like:
 
.sub _main
.local pmc m4_eval_compiler_lib
m4_eval_compiler_lib = loadlib "m4_eval_compiler"
compreg P1, "m4_eval_compiler"
.local string code
code = '1 + 1'
.local pmc compiled_code
compiled_code = compile P1, code
invoke compiled_code
AFTER_INVOCATION: print "compiled sub has been invoked\n"
end
.end

This works almost as expected. The problem is that I never get to the label
AFTER_INVOCATION. So here are my questions:

Is there a way to tell 'compiled_code' to continue with 'AFTER_INVOKATION' when
it is invoked? 
This might be analogous to newsub(out PMC, in INT, labelconst INT).
Can I use 'invokecc' for that?
Am I completely on the wrong track?
How can I retrieve return values from 'compiled_code'?

CU, Bernhard
CU, Bernhard

[perl #28426] Failed test on Fedora linux

2004-04-13 Thread via RT

# New Ticket Created by  Walter G 
# Please include the string:  [perl #28426]
# in the subject line of all future correspondence about this issue. 
# http://rt.perl.org:80/rt3/Ticket/Display.html?id=28426 >


Hi:

I downloaded the latest Parrot distribution via CVS
and installed it on my system. I'm using Fedora Linux
on a x86 platform. However, I'm getting 1 failed test;
I haven't seen this failed test anywhere else on the
mail list.

Here's the failed test output:

# Looks like you failed 1 tests of 17.
t/pmc/object-meths..dubious
Test returned status 1 (wstat 256, 0x100)
Scalar found where operator expected at (eval 157)
line 1, near "'int'  $__val"
(Missing operator before   $__val?)
DIED. FAILED test 17
Failed 1/17 tests, 94.12% okay

Here's the contents of the myconfig file in my parrot
directory:
Summary of my parrot 0.1.0 configuration:
  configdate='Wed Apr  7 22:52:36 2004'
  Platform:
osname=linux, archname=i386-linux-thread-multi
jitcapable=1, jitarchname=i386-linux,
jitosname=LINUX, jitcpuarch=i386
execcapable=1
perl=/usr/bin/perl
  Compiler:
cc='gcc', ccflags='-D_REENTRANT -D_GNU_SOURCE
-DTHREADS_HAVE_PIDS -DDEBUGGING  -I/usr/local/include
-D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64
-I/usr/include/gdbm',
  Linker and Libraries:
ld='gcc', ldflags=' -L/usr/local/lib',
cc_ldflags='',
libs='-lnsl -ldl -lm -lcrypt -lutil -lpthread'
  Dynamic Linking:
so='.so', ld_shared='-shared -L/usr/local/lib',
ld_shared_flags=''
  Types:
iv=long, intvalsize=4, intsize=4, opcode_t=long,
opcode_t_size=4,
ptrsize=4, ptr_alignment=4 byteorder=1234,
nv=double, numvalsize=8, doublesize=8
[

Is this failed test fatal? Has anyone else seen this
failed test?

Thanks,
Walt

Re: Unicode step by step

2004-04-13 Thread Marcus Thiesen

On Saturday 10 April 2004 15:13, Leopold Toetsch wrote:
> There is of course still the question: Should we really have ICU in the
> tree. This needs tracking updates and patching (again) to make it build
> and so on.

In the sake of platform independence I'd say to keep it there. It's far easier 
if you have only the usual build dependencies and the one special thing 
inside the tree to quick test on different platforms.
What I want to say that you'll find a sane build environment and a Perl on 
most of the machines, but even I don't have ICU installed.
BTW, it doesn't compile on any platform at the moment, after a realclean on 
the first "make" it complains about 
../data/locales/ja.txt:15: parse error. Stopped parsing with 
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3

If you do a make at this point again, it skips these steps and tries to link 
parrot, failing on many undefined symbols, I believe from the non existent 
ICU. 

> Thanks,
> leo

Have fun,
Marcus

-- 
 :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: 

I can resist anything but temptation
  Oscar Wilde

pgp0.pgp
Description: signature

[perl #28473] [PATCH] ICU data directory configuration

2004-04-13 Thread via RT

# New Ticket Created by  Jeff Clites 
# Please include the string:  [perl #28473]
# in the subject line of all future correspondence about this issue. 
# http://rt.perl.org:80/rt3/Ticket/Display.html?id=28473 >


Here's a patch to make the location if ICU's data files configurable, 
and also to cause parrot to throw an exception at string_init time if 
the data files are not found.



icu-configuration.patch
Description: Binary data

Two more ICU build issues

2004-04-13 Thread Marcus Thiesen

Hi,

I noted two more things connected to ICU building on different platforms. 
One thing is that the ICU build process is quite keen on using "gmake" for 
building, even localizing it and saying something like 
"You must use /usr/local/bin/gmake to build ICU". 
On OpenBSD, where I just used "make" as on any other platform, this broke 
because the makefiles didn't work correctly.
Another thing, which is not really bad but I run into at the moment is that on 
the system where I run my Cygwin tests the homedir actually is named 
"/home/Gerd & Jutta", which are the names of my father and his girlfriend who 
own the machine and is a perfectly valid Windows username. Everything worked 
fine till now, but the ICU scripts seem not to cope with whitespaces and "&" 
very good, leading to the problem that the mkinstalldirs chomps off 
everything after the first whitespace and leading to a failing installation. 
Have fun,
Marcus

-- 
 :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: 

More than any other time in history, mankind faces a crossroads. One path 
leads to despair and utter hopelessness. The other, to total extinction. Let 
us pray we have the wisdom to choose correctly
  Woody Allen


pgp0.pgp
Description: signature

Re: Unicode step by step

2004-04-13 Thread Jeff Clites

BTW, it doesn't compile on any platform at the moment, after a 
realclean on
the first "make" it complains about
../data/locales/ja.txt:15: parse error. Stopped parsing with
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3
Try a "make realclean" first--Dan checked in a fix for this, and it 
seems to require this to force everything to start fresh.

If you do a make at this point again, it skips these steps and tries 
to link
parrot, failing on many undefined symbols, I believe from the non 
existent
ICU.
At this point I'd expect it to link, but maybe not run well--that 
failure comes when packaging up the data files, and at that point the 
the libraries themselves should already be built and in the right 
place. But you are detecting some "loose" behavior in the Makefile, 
which was done in part so that ICU wouldn't rebuild unless you "make 
clean".

JEff

Re: compile, invoke and then something else

2004-04-13 Thread Leopold Toetsch

Bernhard Schmalhofer <[EMAIL PROTECTED]> wrote:
> Hi,

> I am trying to implement the 'eval' macro im Parrot m4. The Parrot m4
> interpreter is implemented in PIR. The 'eval' is a simple interpreter for
> integer artithmetic and forms thus a micro language within a mini language.

> Can I use 'invokecc' for that?

Sure. It works with japh16.

> How can I retrieve return values from 'compiled_code'?

Should work like with any other subroutine that follows PCC (pdd03).

> CU, Bernhard

leo

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Leopold Toetsch

Jeff Clites <[EMAIL PROTECTED]> wrote:

> On Apr 12, 2004, at 9:54 AM, Leopold Toetsch (via RT) wrote:

> It looks very similar to what I had come up with. The only important
> differences are:

> 1) My version handles the case of code points > 0x as well. (The
> string_append_chr function encapsulates the logic of dealing with the
> "anything above 0xFF" case, but needs to be rewritten to improve
> efficiency.)

Yep. Using the string_append_chr() function for setting chars/UChars in
an existing buffer is overkill. It allocates a new string for each char.
We have a maximum length for the 1/2/4 byte encodings. Unescaping
doesn't create longer strings, so we can always safely fill an existing
buffer (given that it's upscaled beforehand if needed).

Anyway. We'll need 2 version of unescape. One with ICU/Unicode and one
without. The latter will only deal with chars <= 0xff.

BTW we'll need a "not a STRING" encoding too. We need some means for
trasparently handling e.g frozen bytecode. We must assure, that such a
frozen image goes in and out unaltered.

> 2) When I was implementing the previous version of
> string_unescape_cstring, I'm pretty sure I had a reason for doing that
> string_constant_copy at the end, rather than creating a constant string
> at the beginning. I'm not recalling 100% why, but I believe that there
> were problems in the case where the string has to expand its storage
> because there are characters > 0xFF, if had been created as a constant.

No problem with growing here. "constant" here just means, that the
string is allocated in the constant string header pool. The only
difference is that this pool isn't scanned for dead strings during the
collect phase of DOD.

The reason might be that currently the only usage of string_unescape is
from inside imcc/pbc, where constant strings are generated for the
constant table. This usage of the function is a bit special. So we might
pass in 2 more parameters to string_unescape:

  flags ... PObj_constant_FLAG yes/no
  "uconv"  ... e.g. "iso-8859-15" or what not

I've currently a modified version of string_unescape that can deal (or
should finally, if all bugs are gone ;) with input strings like:

   "¤"  # currency sign but when seen as latin9 character
# then it's euro sign

PASM/PIR syntax could be something like:

  :iso-8859-15:"a string ¤"

> Just a tiny note:

> instead of this:
>   result->bufused = d * (had_int16 ? 2 : 1);

> you can do this:
>   result->bufused = string_max_bytes(interpreter, result,
> result->strlen);

Yep. Thanks.

> I'm attaching a patch which contains the version I had written, and
> also includes my changes from [perl #28473], which I didn't see make it
> to the list. Take a look, and you can probably take the best parts of
> both--I'm sure there are a few places where your version is more
> efficient. (Also, I have the couple of bits which call directly into
> the ICU API factored out into string_primitives.c)

I'll merge the relevant bits.

> BTW, I have some benchmarks that I will clean up and send in to go with
> your tests.

Good. Thanks.

> JEff

leo

Re: Two more ICU build issues

2004-04-13 Thread Jeff Clites

On Apr 12, 2004, at 4:40 AM, Marcus Thiesen wrote:

Another thing, which is not really bad but I run into at the moment is 
that on
the system where I run my Cygwin tests the homedir actually is named
"/home/Gerd & Jutta", which are the names of my father and his 
girlfriend who
own the machine and is a perfectly valid Windows username. Everything 
worked
fine till now, but the ICU scripts seem not to cope with whitespaces 
and "&"
very good, leading to the problem that the mkinstalldirs chomps off
everything after the first whitespace and leading to a failing 
installation.
I had tried to guard against that case by making sure that the install 
path was quoted properly for passing to ICU's configure, but probably 
after that it doesn't end up quoted in the ICU Makefiles, and you get 
that behavior. It's an annoying side effect of not being able to give 
the ICU configure a relative path to install to (if you do that, things 
end up in the wrong place). I suppose we could have ICU not install, 
and move the libraries into place manually, though that may lead to 
other problems.

But problems like that are dangerous--if something tries to delete a 
directory, you can end up removing much more than you intended. 
(Ironically, that's exactly the sort of problem that Perl itself never 
has, but shell scripts do.) But fortunately, I don't think anything 
here should be trying to do any deletes via that full path.

JEff

Re: Unicode step by step

2004-04-13 Thread luka frelih

just a confirmation...
my i386 debian linux gives the same error repeatedly after make 
realclean,
if i make again, it compiles a broken parrot which fails (too) many 
tests...

also it seems (to me) that parrot's configured choice of compiler, 
linker, ... is not used in building icu?

does icu have some non-ubiquitous dependencies?

LF

../data/locales/ja.txt:15: parse error. Stopped parsing with
U_INVALID_FORMAT_ERROR
couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR
make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3
Try a "make realclean" first--Dan checked in a fix for this, and it 
seems to require this to force everything to start fresh.

If you do a make at this point again, it skips these steps and tries 
to link
parrot, failing on many undefined symbols, I believe from the non 
existent
ICU.
At this point I'd expect it to link, but maybe not run well--that 
failure comes when packaging up the data files, and at that point the 
the libraries themselves should already be built and in the right 
place. But you are detecting some "loose" behavior in the Makefile, 
which was done in part so that ICU wouldn't rebuild unless you "make 
clean".

Re: semantic and implementation of pairs

2004-04-13 Thread Stéphane Payrard

I have confused assignement and initialisation in my previous
mail. Because they are two different operations, there is no
problem they have different semantics. A6 described both
operations. It described pairs as arguments used to initialize
parameters and pairs in assignement.

--
  stef

Re: Plans for string processing

2004-04-13 Thread Dan Sugalski

At 10:42 AM +0300 4/13/04, Jarkko Hietaniemi wrote:
Matt Fowles wrote:

 Dan~

 I know that you are not technically required to defend your position,
 but I would like an explanation of one part of this plan.
 Dan Sugalski wrote:

4) We will *not* use ICU for core functions. (string to number or number
to string conversions, for example)


 Why not?  It seems like we would just be reinventing a rather large
 wheel here.
Without having looked at what ICU supplies in this department I would
guess it's simply because of the overhead.  atoi() is probably quite a
bit faster than pulling in the full support for TIBETAN HALF THREE.
(Though to be honest I think Parrot shouldn't trust on atoi() or any
of those guys: Perl 5 has tought us not to put trust too much on them.
Perl 5 these days parses all the integer formats itself.)
That's part of it, yep--if we want it done the way we want it, we'll 
need to do it ourselves, and it'll likely be significantly faster.

Also, there's the issue of not requiring ICU, which makes it 
difficult to do string conversion if it isn't there... :)
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Another simple perl task

2004-04-13 Thread Dan Sugalski

At 2:54 PM +0200 4/8/04, Stefan Lidman wrote:
 >Here's something for someone who wants to dig in a bit and needs a
place to start.

Many, but by no means all, of the ops are JITted right now. There's
code to mess about with the JITting in jit2h.pl. What would be nice
is if there was a way to get a list of the ops that are *not* JITted,
so it'd be easy to poke around and add new ops as we go. (It'd also
be nice to verify an MD5, or other checksum, of the actual op source
with a data file in the distribution so we can see when an op changes
and invalidates the JITted version)
--
 Dan
Hi

I am not sure I have done the right thing here but I think the
following program does what you want. It must be run in the
parrot root dir. Shold it be added to an existing file? Which?
This looks quite nice and works as a standalone utility, so I'll put 
it in the repository as build_tools/list_unjitted.pl.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Unicode step by step

2004-04-13 Thread Marcus Thiesen

On Tuesday 13 April 2004 13:28, luka frelih wrote:
> just a confirmation...
> my i386 debian linux gives the same error repeatedly after make
> realclean,
> if i make again, it compiles a broken parrot which fails (too) many
> tests...
>
> also it seems (to me) that parrot's configured choice of compiler,
> linker, ... is not used in building icu?
>
> does icu have some non-ubiquitous dependencies?

As I said yesterday, it worked on a machine of mine which I hadn't touched for 
quite some while. On my notebook, where I do daily builds, I ran in the same 
problem, even after having made a realclean. 
So I did a "make clean" in the icu subdir directly, deleted all files which 
are listed in .cvsignore and ran the "realclean configure build test" all 
over and now it works. Seems as if something doesn't get cleaned up in icu 
with a parrot realclean.

Have fun,
Marcus

-- 
 :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: 

Do something every day that you don't want to do; this is the golden rule for 
acquiring the habit of doing your duty without pain
   Mark Twain

pgp0.pgp
Description: signature

Re: Another simple perl task

2004-04-13 Thread Leopold Toetsch

Dan Sugalski <[EMAIL PROTECTED]> wrote:

> This looks quite nice and works as a standalone utility, so I'll put
> it in the repository as build_tools/list_unjitted.pl.

or tools/dev as it isn't quite required for building parrot?

leo

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Leopold Toetsch

Jeff Clites <[EMAIL PROTECTED]> wrote:

> 1) My version handles the case of code points > 0x as well.

It's simple now to change that code to include this. I've not done it
yet to keep this patch smaller, which includes #28473 too. My version is
just smaller, cleaner, and faster ;)

So config stuff applied + some bits from #28494.

leo

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Jeff Clites

On Apr 13, 2004, at 7:18 AM, Leopold Toetsch wrote:

Jeff Clites <[EMAIL PROTECTED]> wrote:

1) My version handles the case of code points > 0x as well.
It's simple now to change that code to include this. I've not done it
yet to keep this patch smaller, which includes #28473 too. My version 
is
just smaller, cleaner, and faster ;)
Ha, ha, good!

One other thing occurred to me, to save a few bytes: When upscaling, 
rather than passing clength, we can pass (result->strlen + number of 
bytes left in cstring). You made the very good point that we know the 
max size possible, and since at that point we've just parsed over an 
escape sequence, we know the max final size is a little less than 
clength.

So config stuff applied + some bits from #28494.
Good!

JEff

Re: Another simple perl task

2004-04-13 Thread Dan Sugalski

At 4:26 PM +0200 4/13/04, Leopold Toetsch wrote:
Dan Sugalski <[EMAIL PROTECTED]> wrote:

 This looks quite nice and works as a standalone utility, so I'll put
 it in the repository as build_tools/list_unjitted.pl.
or tools/dev as it isn't quite required for building parrot?
That works too--I don't really care as long as it's in. :) If you 
want to move it go ahead, that's fine. (I think I may take a shot at 
some of the missing i386 ops, just for fun)
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Leopold Toetsch

Jeff Clites <[EMAIL PROTECTED]> wrote:

> One other thing occurred to me, to save a few bytes: When upscaling,
> rather than passing clength, we can pass (result->strlen + number of
> bytes left in cstring).

If I read that correctly, s->strlen (or clength) is the desired length.
- on creation a onebyte STRING is created with clength (-1)
- on upscaling this is still the length which then get's doubled

We could use (d + 1 + strlen_left) but this isn't worth the effort.

> JEff

leo

Re: Another simple perl task

2004-04-13 Thread Leopold Toetsch

Dan Sugalski <[EMAIL PROTECTED]> wrote:
> At 4:26 PM +0200 4/13/04, Leopold Toetsch wrote:
>>
>>or tools/dev as it isn't quite required for building parrot?

> That works too--I don't really care as long as it's in. :) If you
> want to move it go ahead, that's fine. (I think I may take a shot at
> some of the missing i386 ops, just for fun)

If you wonna have fun please look at t/pmc/object-meths_17 labeled:
"constructor - diamond parents" :)

leo

Re: Another simple perl task

2004-04-13 Thread Dan Sugalski

At 5:52 PM +0200 4/13/04, Leopold Toetsch wrote:
Dan Sugalski <[EMAIL PROTECTED]> wrote:
 At 4:26 PM +0200 4/13/04, Leopold Toetsch wrote:
or tools/dev as it isn't quite required for building parrot?

 That works too--I don't really care as long as it's in. :) If you
 want to move it go ahead, that's fine. (I think I may take a shot at
 some of the missing i386 ops, just for fun)
If you wonna have fun please look at t/pmc/object-meths_17 labeled:
"constructor - diamond parents" :)
Heh. I ought to do that in the next day or so.
--
Dan
--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Jeff Clites

On Apr 13, 2004, at 8:35 AM, Leopold Toetsch wrote:

Jeff Clites <[EMAIL PROTECTED]> wrote:

One other thing occurred to me, to save a few bytes: When upscaling,
rather than passing clength, we can pass (result->strlen + number of
bytes left in cstring).
If I read that correctly, s->strlen (or clength) is the desired length.
- on creation a onebyte STRING is created with clength (-1)
- on upscaling this is still the length which then get's doubled
When you start off, clength is the right thing, but once you hit an 
escape sequence, you find out that some of the input bytes were part of 
a single escape sequence. That is, consider this string which needs 
unescaping:

ab\x{212b}de  //clength is 12
--^
When you get to the section of code that is about to trigger the 
upscale, you'll have 2 characters ("a" and "b") already in your 
accumulated string, you're about the add the Angstrom character, and 
you know you only have 2 more bytes to parse. So at that point, you 
know the max characters you could end up with is 5 (2 + 1 + 2), so when 
you call upscale, you could pass in 5 rather than 12. That's not a huge 
savings, but the nice thing in this case is that you will have 
originally allocated 12 bytes for the result string, and while 
upscaling you're saying you need room for 5 character == 10 bytes for 
rep-2, so the actual allocated storage doesn't have to be expanded. (If 
you passed in 12, it would make room for 24 bytes in the upscaled 
string, even though it didn't need them.)

Not an enormous savings, but worth the tiny bit of math, probably, 
since we'd know for sure that we'd be allocating more storage than we 
need.

[Note: _string_upscale is currently simple, but not optimized. We 
should enhance it for the case where we can upscale in place because we 
know that we have enough storage already allocated accommodate 
max(passed in length, current length). That's what would let the above 
be a savings.]

JEff

Re: Unicode step by step

2004-04-13 Thread Leopold Toetsch

Marcus Thiesen wrote:
. Seems as if something doesn't get cleaned up in icu 
with a parrot realclean.
Yep. I've removed cleaning icu from clean/realclean[1].

$ make help | grep clean
...
icu.clean:   ...
And there is always "make cvsclean".

Have fun,
Marcus
leo

[1] If anyone puts that in again he might also send a lot faster PC to 
me (and possibly other developers ;)

Re: Unicode step by step

2004-04-13 Thread Dan Sugalski

At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote:
Marcus Thiesen wrote:
. Seems as if something doesn't get cleaned up in icu with a parrot 
realclean.
Yep. I've removed cleaning icu from clean/realclean[1].
I think we need to put that back for a bit, but with this:

[1] If anyone puts that in again he might also send a lot faster PC 
to me (and possibly other developers ;)
We're also likely going to be well-off if we get configure to detect 
a system ICU install and use that instead. It shouldn't be that 
tough, but I've not had a chance to poke around in the icu part of 
our config system to find out what we need to do.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: [perl #28494] [PATCH] unescape strings

2004-04-13 Thread Leopold Toetsch

Jeff Clites <[EMAIL PROTECTED]> wrote:

> ab\x{212b}de  //clength is 12
> --^

> ... end up with is 5 (2 + 1 + 2)

Ok, ok. You are right. As the string goes into constants and isn't
changed, we'll do it right, which is, as metioned,

  d + 1 + length_todo

> [Note: _string_upscale is currently simple, but not optimized. We
> should enhance it for the case where we can upscale in place because we
> know that we have enough storage already allocated accommodate
> max(passed in length, current length). That's what would let the above
> be a savings.]

It should probably only allocate the string memory - no header. As in
unmake_COW() s. "STRING for_alloc;". Then copy over the memory bits.
You can't reallocate the string though. When it's possible in
place--better.

And finally *all* string allocation functions need the flags to decide
if to use constants pools or not (e.g. string_make_empty).

> JEff

leo

Re: Plans for string processing

2004-04-13 Thread Aaron Sherman

Ok, I'm still lost on the language thing. I'm not arguing, I just don't
get it, and I feel that if I'm going to do some of the things that I
want to for Perl 6, I'm going to have to get it.

On Mon, 2004-04-12 at 11:43, Dan Sugalski wrote:

> Language
> 
> *) Provides language-sensitive manipulation of characters (case mangling)
> *) Provides language-sensitive comparisons

Those two things do not seem to me to need language-specific strings at
all. They certainly need to understand the language in which they are
operating (avoiding the use of the word locale here, as per Larry's
concerns), but why does the language of origin of the string matter?

For example, in Perl5/Ponie:

@names=;
print "Phone Book: ", sort(@names), "\n";

In this example, I don't see why I would care that NAMES might be a
pseudo-handle that iterates over several databases, and returns strings
in the 7 different languages that those databases happen to contain. I
want my Phone Book sorted in a way that is appropriate to the language
of my phone book, with whatever special-case rules MY language has for
sorting funky foreign letters (and that might mean that even though a
comparison of two strings is POSSIBLE, in the current language it might
yield an exception, e.g. because Chinese and Japanese share a great many
characters that can be roughly converted, but neither have meaning in my
American English comparison).

More generally, an operation performed on a string (be it read
(comparison) or write (upcase, etc)) should be done in the way that the
*caller* expects, regardless of what legacy source the string came from
(I daren't even guess where that string that I got over a Parrot-enabled
CORBA might have been fetched from or if the language is still used
since it was stored in a cache somewhere 200 years ago, and it damn well
better not affect my sorting, no?)

Ok, so that's my take... what am I missing?

> *) Provides language-sensitive character overrides ('ll' treated as a 
> single character, for example, in Spanish if that's still desired)
> *) Provides language-sensitive grouping overrides.

Ah, and here we come to my biggest point of confusion.

You describe logic that surrounds a given language, but you'll never
need "cmp" to know how to compare Spanish "ll" to English "ll", for
example. In fact, that doesn't even make sense to me. What you will need
is for cmp to know the Spanish comparison rules so that when it gets two
strings to compare, and it is asked to do so in Spanish, the proper
thing will happen.

I guess this boils down to two choices:

a) All strings will have the user's language by default

or

b) Strings will have different languages and behave according to their
"sources" regardless of the native rules of the user.

"b" seems to me to yield very surprising results, and not at all justify
the baggage placed inside a string. If I can be forgiven for saying so,
it's even close to Perl 4's $], which allowed you to change the
semantics of arrays, only here, you're doing it as a property on a
string so that I can't trust that any string will behave the way I
expect unless I "untaint" it.

Again, I'm asking for corrections here.

> IW: Mush together (either concatenate or substr replacement) two 
> strings of different languages but same charset

According to whose rules? Does it make sense to merge an American
English string with a Japanese string unless you have a target language?

This means that someone's rules must become dominant, and as a
programmer, I'm expecting that to be neither string a nor string b, but
the user's. If the user happens to be Portuguese, then I would expect
that some kind of exception is going to emerge, but if the user is
Japanese, then it makes sense, and American English can be treated as
romaji, and an exception thrown if non-romaji ascii characters are used.
Again, this is not something that the STRING can really have much of a
clue about. It's all context.

What is the reason for every string value carrying around such context?
Certainly numbers don't carry around their base as context, and yet
that's critical when converting to a string!

-- 
Aaron Sherman <[EMAIL PROTECTED]>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

Re: Plans for string processing

2004-04-13 Thread Dan Sugalski

At 1:55 PM -0400 4/13/04, Aaron Sherman wrote:
Ok, I'm still lost on the language thing. I'm not arguing, I just don't
get it, and I feel that if I'm going to do some of the things that I
want to for Perl 6, I'm going to have to get it.
On Mon, 2004-04-12 at 11:43, Dan Sugalski wrote:

 Language
 
 *) Provides language-sensitive manipulation of characters (case mangling)
 *) Provides language-sensitive comparisons
Those two things do not seem to me to need language-specific strings at
all. They certainly need to understand the language in which they are
operating (avoiding the use of the word locale here, as per Larry's
concerns), but why does the language of origin of the string matter?
Because the way a string is upcased/downcased/titlecased depends on 
the language the string came from. The treatment of accents and a 
number of specific character sequences depends on the language the 
string came from. Ignore it and, well, you're going to find that 
you're messing up the display of someone's name. That strikes me as 
rather rude.

You also don't always have the means of determining what's right. 
It's particularly true of library code.

For example, in Perl5/Ponie:

@names=;
print "Phone Book: ", sort(@names), "\n";
In this example, I don't see why I would care that NAMES might be a
pseudo-handle that iterates over several databases, and returns strings
in the 7 different languages that those databases happen to contain.
Then *you* don't. That's fine. Why, though, do you assume that 
*nobody* will? That's the point.

You may decide that all strings shall be treated as if they were in 
character set X, and language Y, whatever that is. Fine. You may 
decide that the language you're designing will treat all strings as 
if they're in character set X and language Y. That's fine too. Parrot 
must support the capability of forcing the decision, and we will.

What I don't want to do is *force* uniformity. Some of us do care. If 
we do it the way I want, then we can ultimately both do what we want. 
If we do it the way you want, though, we can't--I'm screwed since the 
data is just not there and can't *be* there.

We've tried the whole monoculture thing before. That didn't work with 
ASCII, EBCDIC, any of the Latin-x, ISO-whatever, and it's not working 
for a lot of folks with Unicode. (Granted, only a couple of billion, 
so it's not *that* big a deal...) We've also tried the whole global 
setting thing, and if you think that worked I dare you to walk up to 
Jarkko and whisper "Locale" in his ear.

If you want to force a simplified view of things as either an app 
programmer or language designer, well, great. I am OK with that. More 
than OK, really, and I do understand the desire. What I'm not OK with 
is mandating that simplified view on everyone.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Plans for string processing

2004-04-13 Thread Brent 'Dax' Royal-Gordon

Dan Sugalski wrote:
> 1) Parrot will *not* require Unicode. Period. Ever.
My old 8MB Visor Prism thanks you.

> *) Transform stream of bytes to and from a set of 32-bit integers
> *) Manages byte buffer (so buffer positioning and manipulation by code
> point offset is handled here)
What's wrong with, *as an internal optimization only*, storing the
string in the more efficient-to-access format of the patch?  I mean,
yeah, you don't want it to be externally visible, but if you're going to
treat a string as a series of ints, why not store it that way?
I really see no reason to store strings as UTF-{8,16,32} and waste CPU
cycles on decoding it when we can do a lossless conversion to a format
that's both more compact (in the most common cases) and faster.
--
Brent "Dax" Royal-Gordon <[EMAIL PROTECTED]>
Perl and Parrot hacker
Oceania has always been at war with Eastasia.

Re: Plans for string processing

2004-04-13 Thread Dan Sugalski

At 12:44 PM -0700 4/13/04, Brent 'Dax' Royal-Gordon wrote:
Dan Sugalski wrote:
1) Parrot will *not* require Unicode. Period. Ever.
My old 8MB Visor Prism thanks you.
:) As does my gameboy.

*) Transform stream of bytes to and from a set of 32-bit integers
*) Manages byte buffer (so buffer positioning and manipulation by 
code point offset is handled here)
What's wrong with, *as an internal optimization only*, storing the 
string in the more efficient-to-access format of the patch?  I mean, 
yeah, you don't want it to be externally visible, but if you're 
going to treat a string as a series of ints, why not store it that 
way?

I really see no reason to store strings as UTF-{8,16,32} and waste 
CPU cycles on decoding it when we can do a lossless conversion to a 
format that's both more compact (in the most common cases) and 
faster.
Erm... UTF-32 is a fixed-width encoding. (That Unicode is inherently 
a variable-width character set is a separate issue, though given the 
scope of the project a correct decision) I'm fine with leaving ICU to 
store unicode data internally any damn way it wants, though--partly 
because the IBM folks are Darned Clever and I trust their judgement, 
and partly because it means we don't have to write all the code to 
properly handle Unicode.

Other variable-width encodings will likely be stored internally as 
fixed-width buffers, at least once the data gets manipulated some. 
Assuming I'm not convinced that Unicode is the true way to go... :)
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

[perl #28531] [PATCH] C++ comment fix

2004-04-13 Thread via RT

# New Ticket Created by  Adam Thomason 
# Please include the string:  [perl #28531]
# in the subject line of all future correspondence about this issue. 
# http://rt.perl.org:80/rt3/Ticket/Display.html?id=28531 >


A handful of // comments are lingering around the tree.  Patch fixes them to use /* 
... */.

Adam


comment.patch
Description: Binary data

Re: Plans for string processing

2004-04-13 Thread Michael Scott

On 12 Apr 2004, at 17:43, Dan Sugalski wrote:

IW: Mush together (either concatenate or substr replacement) two 
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is thrown. 
If so, we do the operation. If one string is manipulated the language 
stays whatever that string was. If a new string is created either the 
left side wins or the default language is used, depending on the 
interpreter setting.

Does that mean that a Parrot string will always have a specific 
language associated with it?

Mike

Re: Plans for string processing

2004-04-13 Thread Dan Sugalski

At 10:44 PM +0200 4/13/04, Michael Scott wrote:
On 12 Apr 2004, at 17:43, Dan Sugalski wrote:

IW: Mush together (either concatenate or substr replacement) two 
strings of different languages but same charset
TP: Checks to see if that's allowed. If not, an exception is 
thrown. If so, we do the operation. If one string is manipulated 
the language stays whatever that string was. If a new string is 
created either the left side wins or the default language is used, 
depending on the interpreter setting.

Does that mean that a Parrot string will always have a specific 
language associated with it?
Yes.

Note that the language might be "Dunno". :) There'll be a default 
that's assigned to input data and suchlike things, and the language 
markers in the strings can be overridden by code.
--
Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Plans for string processing

2004-04-13 Thread Michael Scott

On 13 Apr 2004, at 22:48, Dan Sugalski wrote:

Note that the language might be "Dunno". :) There'll be a default 
that's assigned to input data and suchlike things, and the language 
markers in the strings can be overridden by code.

Would this be right?

English + English = English
English + Chinese = Dunno
English + Dunno = Dunno
+ being symmetric.

How does a Dunno string know how to change case?

Mike

Re: Plans for string processing

2004-04-13 Thread Dan Sugalski

At 11:28 PM +0200 4/13/04, Michael Scott wrote:
On 13 Apr 2004, at 22:48, Dan Sugalski wrote:

Note that the language might be "Dunno". :) There'll be a default 
that's assigned to input data and suchlike things, and the language 
markers in the strings can be overridden by code.

Would this be right?

English + English = English
English + Chinese = Dunno
English + Dunno = Dunno
+ being symmetric.
I've been assuming it's a left-side wins, as you're tacking onto an 
existing string, so you'd get English in all cases. Alternately you 
could get an exception. The end result of a mixed-language operation 
could certainly be the Dunno language or the current default--both'd 
be reasonable.

How does a Dunno string know how to change case?
It uses the defaults provided by the character set.
--
Dan
--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Plans for string processing

2004-04-13 Thread Leopold Toetsch

Brent 'Dax' Royal-Gordon <[EMAIL PROTECTED]> wrote:

> I really see no reason to store strings as UTF-{8,16,32} and waste CPU
> cycles on decoding it when we can do a lossless conversion to a format
> that's both more compact (in the most common cases) and faster.

The default format now isn't UTF8. It's a series of fixed sized entries
of either uint_8, uint_16, or uint_32. These reflect most common
encodings which are: char*, USC-2, and UCS-4/UTF-32 (or possibly other
32-bit encodings). This should cover "common" cases.

No cycles are wasted for storing "straight" encodings.

leo

Re: Unicode step by step

2004-04-13 Thread Leopold Toetsch

Dan Sugalski <[EMAIL PROTECTED]> wrote:
> At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote:
>>Marcus Thiesen wrote:
>>>. Seems as if something doesn't get cleaned up in icu with a parrot
>>>realclean.
>>
>>Yep. I've removed cleaning icu from clean/realclean[1].

> I think we need to put that back for a bit,

I did list two alternatives. The "normal" way of changes doesn't include
changes to ICU source (and honestly shouldn't). Currently building is
still a bit in flux, which does mandate a "make icu.clean".

And there is of course already a new ICU version on *their* website, but
we still try to get/keep 2.6 running.

I'm still not sure that this lib should be part of *our* tree ...

> ... but with this:

>>[1] If anyone puts that in again he might also send a lot faster PC
>>to me (and possibly other developers ;)

> We're also likely going to be well-off if we get configure to detect
> a system ICU install and use that instead.

There are severals issues: First one is MANIFEST and CVS and patches.
Config steps should be simple. But - of course - I'd appreciate this
alternative as already layed out.

leo

Re: Plans for string processing

2004-04-13 Thread Leopold Toetsch

Aaron Sherman <[EMAIL PROTECTED]> wrote:
> For example, in Perl5/Ponie:

> @names=;
> print "Phone Book: ", sort(@names), "\n";

> In this example, I don't see why I would care that NAMES might be a
> pseudo-handle that iterates over several databases, and returns strings
> in the 7 different languages

I already did show an example where uc("i") isn't "I". Collating is sill
more cmplex then a »simple« uc().

> More generally, an operation performed on a string (be it read
> (comparison) or write (upcase, etc)) should be done in the way that the
> *caller* expects,

Well, we dont't know what the caller expects. The caller has to decide.
There are basically at least two ways: Treat all strings language
independent (from their origin) or append more information to each
string.

>> *) Provides language-sensitive character overrides ('ll' treated as a
>> single character, for example, in Spanish if that's still desired)
>> *) Provides language-sensitive grouping overrides.

> Ah, and here we come to my biggest point of confusion.

Another example:

 "my dog Fiffi" eq "my dog Fi\x{fb03}"

When my program is doing typographical computations, above equation is
true. And useful. The characters "f", "f", "i" are goin' to be printed.
But the ligature "ffi" takes less space when printed as such.
This is the same character string, though, when I'm a reader of this dog
news paper.

When I do an analysis of counting "f"s in dog names, I don't care if
it's written in one of these forms, it should be the same - or when I
search for "ffi" in the text.

It just depends who's using these features in which context.

> I guess this boils down to two choices:

> a) All strings will have the user's language by default

> or

> b) Strings will have different languages and behave according to their
> "sources" regardless of the native rules of the user.

and/or either the strings or the users default come in depending on the
desired action.

>> IW: Mush together (either concatenate or substr replacement) two
>> strings of different languages but same charset

> According to whose rules?

User level - what do you want to achieve. At codepoint level the
operation is fine. It doesn't make sense above that, though.

> This means that someone's rules must become dominant,

It doesn't make much sense to do

   bors S0, S1   # stringwise bit not

to anything that isn't singlebyte encoded. It depends.

The rules - how and when they apply - still have to be layed out.

leo

Re: Plans for string processing

2004-04-13 Thread Aaron Sherman

Thanks for your response. I'm not sure that you and I are speaking about
exactly the same things, since you state that the logical extensions, if
not outright goals, of an alternate approach would be an exclusionary
monoculture. I'm not sure that's quite right

On Tue, 2004-04-13 at 15:06, Dan Sugalski wrote:

> >>  *) Provides language-sensitive manipulation of characters (case mangling)
> >>  *) Provides language-sensitive comparisons
> >
> >Those two things do not seem to me to need language-specific strings at
> >all. They certainly need to understand the language in which they are
> >operating (avoiding the use of the word locale here, as per Larry's
> >concerns), but why does the language of origin of the string matter?
> 
> Because the way a string is upcased/downcased/titlecased depends on 
> the language the string came from. The treatment of accents and a 
> number of specific character sequences depends on the language the 
> string came from.

> Ignore it and, well, you're going to find that 
> you're messing up the display of someone's name. That strikes me as 
> rather rude.

For proper names, you may have a point (though the ordering of names in
a phone book, for example, is often according to the language of the
book, not the origin of the names), and in some forms of string
processing, that kind of deference to the origin of a word may turn out
to be useful. I do "get" that much.

What I'm not getting is

  * Why do we assume that the language property of a string will be
the language from which the word correctly originates rather
than the locale of the database / web site / file server /
whatever that we received it from? That could actually result in
dealing with native words according to the rules of foreign
languages, and boy-howdy is that going to be fun to debug.
  * Why is it so valuable as to attach a value to every string ever
created for it rather than creating an abstraction at a higher
level (e.g. a class)
  * Why wouldn't you do the same thing for MIME type, as strings may
also (and perhaps more often) contain data which is more
appropriately tagged that way? The SpamAssassin guys would love
you for this!

> What I don't want to do is *force* uniformity. Some of us do care.

Hey, that's a bit of a low blow. I care quite a bit, or I would not ask.
I'm not saying that the guy who wants to sort names according to their
source language is wrong, I'm saying that he doesn't need core support
in Parrot to do it, so I'm curious why it's in there.

> We've tried the whole monoculture thing before.

I just don't think that moving language up a layer or two of abstraction
enforces a monoculture... again, I'm willing to see the light if someone
can explain it.

A lot of your response is about "enforcing", and I'm not sure how I gave
the impression of this being an enforcement issue (or perhaps you think
that non-localization is something that needs to be enforced?) I just
can't see how every string needs to carry around this kind of
world-view-altering context when 99% of programs that use string data
(even those that use mixed encodings) won't want to apply said context,
but rather perform all operations according to their locale. Am I wrong
about that?

One thing that was not answered, though is what happens in terms of
dominance. When sorting French and Norwegian Unicode strings, who loses
(wins?) when you try to compare them? Comparing across language
boundaries would be a monumental task, and would be instantly reviled as
wrong by every language purist in the world (to my knowledge no one has
ever published a uniform way to compare two words, much less arbitrary
text, unless you are willing to do so using the rules of one and only
one culture (and I say culture because often the rules of a culture are
mutually incompatible with those of any one source language's strict
rules)). So, if you have to convert in order to compare, whose language
do you do the comparison in? You can't really rely on LHS vs. RHS, since
a sort will reverse these many times (and C<$a cmp $b> had better be
C<-($b cmp $a)> or your sort may never terminate!)

-- 
Aaron Sherman <[EMAIL PROTECTED]>
Senior Systems Engineer and Toolsmith
"It's the sound of a satellite saying, 'get me down!'" -Shriekback

Re: new libraries

2004-04-13 Thread Tim Bunce

On Sat, Apr 10, 2004 at 01:49:37PM +0300, Jarkko Hietaniemi wrote:
> > 
> > (We've learnt the hard way with Perl5 modules names that more words are good.
> 
> And more words that mean something... "Data" ranks right up there as the
> worst possible names for anything.

(Nah, "Sys" and "System" are at the top of the list :)

Anyone wanting to act as a guiding light for Perl6 module naming is
very welcome. I've been there and done that once. For ten years.
My time is up.

Tim.

43 matches

Mail list logo