Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Branden

Hong Zhang
  A deterministic finalization means we shouldn't need to force
programmers
  to have good ideas. Make it easy, remember? :)

 I don't believe such an algorithm exists, unless you stick with reference
 count.


Either doesn't exist, or is more expensive than refcounting. I guess we have
to make a decision between deterministic finalization and not using
refcounting as GC, because both together sure don't exist.

And don't forget that if we stick with refcounting, we should try to find a
way to break circular references, too.

- Branden




Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Tim Bunce

On Thu, Feb 15, 2001 at 08:21:03AM -0300, Branden wrote:
 Hong Zhang
   A deterministic finalization means we shouldn't need to force
 programmers
   to have good ideas. Make it easy, remember? :)
 
  I don't believe such an algorithm exists, unless you stick with reference
  count.
 
 Either doesn't exist, or is more expensive than refcounting. I guess we have
 to make a decision between deterministic finalization and not using
 refcounting as GC, because both together sure don't exist.
 
 And don't forget that if we stick with refcounting, we should try to find a
 way to break circular references, too.

As a part of that the weak reference concept, bolted recently into perl5,
could be made more central in perl6.

Around 92.769% of the time circular references are known to be circular
by the code that creates them (like a 'handy' ref back to a parent node).
Having a weakref, or similar, operator in the language would help greatly.

Tim.



Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Branden

Tim Bunce wrote:
 On Thu, Feb 15, 2001 at 08:21:03AM -0300, Branden wrote:
  And don't forget that if we stick with refcounting, we should try to
find a
  way to break circular references, too.

 As a part of that the weak reference concept, bolted recently into perl5,
 could be made more central in perl6.

 Around 92.769% of the time circular references are known to be circular
 by the code that creates them (like a 'handy' ref back to a parent node).
 Having a weakref, or similar, operator in the language would help greatly.

Do weakrefs really work on Perl 5? I know they're not incremented when they
are created, but aren't they decremented (and try to free the object) when
they go out of scope? What happens if the object goes out of scope before
the variable with the weakref does?

Weakrefs are probably useful to break circular references in 99% of the
cases. But we must make sure they work properly! And also that bugs while
using them don't dump core, at most throw exceptions.

- Branden




Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Branden

Damien Neil wrote:
 On Thu, Feb 15, 2001 at 08:07:39AM -0300, Branden wrote:
  I think you just said all about why we shouldn't bother giving objects
  deterministic finalization, and I agree with you. If we explicitly want
to
  free resources (files, database connections), then we explicitly call
close.
  Otherwise, it will be called when DESTROY is eventually called.

 No, the question of whether Perl 6 wants deterministic finalization
 or not is a separate one.  If it doesn't have it, we will be losing
 a very common Perl idiom:

   {
 my $fh = IO::File-new("file");
 print $fh $data;
   }


Re-read what you wrote in
http:[EMAIL PROTECTED]/msg02468.html. I
think you've got to decide what you want. Do you want smart GC (without
deterministic finalization) and free resources explicitly on special cases?
Or do you want to keep with common Perl idioms (what probably means
ref-counting)?

I would say I probably prefer refcounting (with some kind of breaking
circular references algorythm), because I see the advantages it brings pay
its price.



 It's nice to know that when the above block exits, $fh will be closed.
 Remember that "closed" doesn't just refer to freeing the resources
 associated with it -- it also includes flushing buffers and the like.


Just set autoflush, if you're lazy...



 Without deterministic finalization, you will almost always want to
 write the above to include an explicit $fh-close().

Exactly.

 The problem is
 that you can not only count on $fh's DESTROY being called at the end of
 the block, you often can't count on it ever happening.

Anyway, the file would be flushed and closed...

 Consider the
 case where the interpreter dies on a signal, for example -- DESTROY
 methods will quite possibly not be called.


Actually, I think this can be worked around. Can't it catch signals?

 I'm not certain that Perl should lose deterministic finalization.  On
 the other hand, I really wish that Perl had a more modern GC scheme,
 if only so that circular structures could be properly collected.

Agree.

 In
 the end, however, I don't think that any of our opinions will decide
 this -- either Dan's forthcoming PDD will show how Perl 6 can have
 its cake and eat it too, or Larry will decide.


OK for me.

- Branden


 - Damien









Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Alan Burlison

Branden wrote:

 Just set autoflush, if you're lazy...

And say goodbye to performance...

  The problem is
  that you can not only count on $fh's DESTROY being called at the end of
  the block, you often can't count on it ever happening.
 
 Anyway, the file would be flushed and closed...

That's not sufficient.  Without deterministic finalisation, what does
the folowing do?

  {
my $fh = IO::File-new("file");
print $fh "foo\n";
  }
  {
my $fh = IO::File-new("file");
print $fh "bar\n";
  }

At present "file" will contain "foo\nbar\n".  Without DF it could just
as well be "bar\nfoo\n".  Make no mistake, this is a major change to the
semantics of perl.

Alan Burlison



Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Hong Zhang

   {
 my $fh = IO::File-new("file");
 print $fh "foo\n";
   }
   {
 my $fh = IO::File-new("file");
 print $fh "bar\n";
   }
 
 At present "file" will contain "foo\nbar\n".  Without DF it could just
 as well be "bar\nfoo\n".  Make no mistake, this is a major change to the
 semantics of perl.
 
 Alan Burlison

This code should NEVER work, period. People will just ask for trouble
with this kind of code.

The DF never exists, even with reference count. Can anyone show me how
to deterministically collect circular reference? The current semantics
of perl works most of time, but not always.

What we really are talking about is "Shall Perl provide 90% or 99% of DF?"
The operating system provides 0% during runtime, 100% at process exit.

Hong






Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Alan Burlison

Hong Zhang wrote:

 This code should NEVER work, period. People will just ask for trouble
 with this kind of code.

Actually I meant to have specified "" as the mode, i.e. append, then
what I originally said holds true.  This behaviour is predictable and
dependable in the current perl implementation.  Without the  the file
will contain just "bar\n".

The point is that we have a stated goal of preserving the existing
semantics, and of allowing existing perl5 code to continue to work. 
Despite what some people seem to think this is *not* a clean slate
situation.  We may well have to deliberately carry over questionable but
depended-upon behaviour into perl6.

my $fh = do { local *FH; *FH; }

for example, better continue to work.

Alan Burlison



Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Hong Zhang

 Hong Zhang wrote:
 
  This code should NEVER work, period. People will just ask for trouble
  with this kind of code.
 
 Actually I meant to have specified "" as the mode, i.e. append, then
 what I originally said holds true.  This behaviour is predictable and
 dependable in the current perl implementation.  Without the  the file
 will contain just "bar\n".

That was not what I meant. Your code already assume the existence of
reference counting. It does not work well with any other kind of garbage
collection. If you translate the same code into C without putting in
the close(), the code will not work at all.

By the way, in order to use perl in real native thread systems, we have
to use atomic operation for increment/decrement reference count. On most
systems I have measured (pc and sparc), any atomic operation takes about
0.1-0.3 micro second, and it will be even worse on large SMP machines.
The latest garbage collection algorithms (parallel and cocurrent) can 
handle large memory pretty well. The cost will be less DF.

Hong




Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Alan Burlison

Hong Zhang wrote:

 That was not what I meant. Your code already assume the existence of
 reference counting. It does not work well with any other kind of garbage
 collection. If you translate the same code into C without putting in
 the close(), the code will not work at all.

Wrong, it does *not* assume any such thing.  It assumes that when a
filehandle goes out of scope it is closed.  How that is achieved is a
detail of the implementation, and could be done in a number of ways.  It
could just as well be done by keeping the filehandle on a stack which
was cleared when the scope exits.  C++ does this for local variables
without requiring a refcount.  

 By the way, in order to use perl in real native thread systems, we have
 to use atomic operation for increment/decrement reference count. On most
 systems I have measured (pc and sparc), any atomic operation takes about
 0.1-0.3 micro second, and it will be even worse on large SMP machines.
 The latest garbage collection algorithms (parallel and cocurrent) can
 handle large memory pretty well. The cost will be less DF.

I think you'll find that both GC *and* reference counting scheme will
require the heay use of mutexes in a MT program.

Alan Burlison



string encoding

2001-02-15 Thread Hong Zhang

Hi, All,

I want to give some of my thougts about string encoding.

Personally I like the UTF-8 encoding. The solution to the
variable length can be handled by a special (virtual)
function like

class String {
virtual UV iterate(/*inout*/ int* index);
};

So in typical string iteration, the code will looks like
for (i = 0; i  size;) {
UV ch = s-iterate(i);
/* do what u want */
}
instead of
for (i = 0; i  size; i++) {
uint32 ch = s-charAt(i);
/* be my guest */
}

The new style will be strange, but not very difficult to
use. It also hide the internal representation.

The UTF-32 suggestion is largely ignorant to internationalization.
Many user characters are composed by more than one unicode code
point. If you consider the unicode normalization, canonical form,
hangul conjoined, hindic cluster, combining character, varama,
collation, locale, UTF-32 will not help you much, if at all.

Hong




Adoption ??: Rare Salt-Water Camel May Be Separate Species

2001-02-15 Thread John van V



  http://news.bbc.co.uk/hi/english/sci/tech/newsid_1156000/1156212.stm

This nuclear/dynamite stuff is making me sad.

Wanna contribute in the name of perl ??  Lets see... China + UN = $perl_revenue



Re: string encoding

2001-02-15 Thread Simon Cozens

On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
 Personally I like the UTF-8 encoding. The solution to the
 variable length can be handled by a special (virtual)
 function like

I'm expecting that the virtual, internal representation will not
be in a UTF but will simply be an array of codepoints. Manipulating
UTF8 internally is horrible because it's a variable length encoding,
so you need to keep track of where you are both in terms of characters
and bytes. Yuck, yuck, yuck.

-- 
Calm down, it's *only* ones and zeroes.



Re: string encoding

2001-02-15 Thread Hong Zhang

 On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
  Personally I like the UTF-8 encoding. The solution to the
  variable length can be handled by a special (virtual)
  function like
 
 I'm expecting that the virtual, internal representation will not
 be in a UTF but will simply be an array of codepoints. Manipulating
 UTF8 internally is horrible because it's a variable length encoding,
 so you need to keep track of where you are both in terms of characters
 and bytes. Yuck, yuck, yuck.

I am not sure if you have read through my email.

The concept of characters have nothing to do with codepoints.
Many characters are composed by more than one codepoints.

The concept of character position is completely useless in
many languages. Many languages just don't have the English-style
"character", see collation, hungul conjoined, combining characters.
There is just no easy way to keep track of character position.
What you really meant was probably the codepoint position.
The codepoint position is largely internal to library.
As long as regular expression can efficiently handle utf-8,
(as it does now), most people will feel just fine with it.

There are just not many people interested in the codepoint
position, if they ever heard of it. They care more about
m// or s///.

Even you want to keep track the character offsets, it is still much
easier than many other unicode features I mentioned.

Hong




Re: string encoding

2001-02-15 Thread Jarkko Hietaniemi

On Thu, Feb 15, 2001 at 11:16:29PM +, Simon Cozens wrote:
 On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote:
  Personally I like the UTF-8 encoding. The solution to the
  variable length can be handled by a special (virtual)
  function like
 
 I'm expecting that the virtual, internal representation will not
 be in a UTF but will simply be an array of codepoints. Manipulating
 UTF8 internally is horrible because it's a variable length encoding,
 so you need to keep track of where you are both in terms of characters
 and bytes. Yuck, yuck, yuck.

...and because of this you can't randomly access the string, you are
reduced to sequential access (*).  And here I thought we could have
left tape drives to the last millennium.

(*) Yes, of course you could cache your sequential access so you only
need to do it once, and build balanced trees and whatnot out of those
offsets to have random access emulated in O(n lg n), but as soon as
you update the string, you have to update the tree, or whatever data
structure you chose.  Pain, pain, pain.

 -- 
 Calm down, it's *only* ones and zeroes.

I wish more people would keep this in mind.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen



Re: string encoding

2001-02-15 Thread Simon Cozens

On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
 The concept of characters have nothing to do with codepoints.
 Many characters are composed by more than one codepoints.

This isn't true.

-- 
* DrForr digs around for a fresh IV drip bag and proceeds to hook up.
dngor Coffee port.
DrForr Firewalled, like everything else around here.



Re: string encoding

2001-02-15 Thread Hong Zhang

 On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote:
  The concept of characters have nothing to do with codepoints.
  Many characters are composed by more than one codepoints.
 
 This isn't true.

What do you mean? Have you seen people using multi-byte encoding
in Japan/China/Korea?

Hong




Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Ken Fox

Alan Burlison wrote:
 I think you'll find that both GC *and* reference counting scheme will
 require the heay use of mutexes in a MT program.

There are several concurrent GC algorithms that don't use
mutexes -- but they usually depend on read or write barriers
which may be really hard for us to implement. Making them run
well always requires help from the OS memory manager and that
would hurt portability. (If we don't have OS support it means
auditing everybody's XS code to make sure they use wrappers
with barrier checks on all writes. Yuck.)

- Ken



Re: string encoding

2001-02-15 Thread Hong Zhang

 ...and because of this you can't randomly access the string, you are
 reduced to sequential access (*).  And here I thought we could have
 left tape drives to the last millennium.
 
 (*) Yes, of course you could cache your sequential access so you only
 need to do it once, and build balanced trees and whatnot out of those
 offsets to have random access emulated in O(n lg n), but as soon as
 you update the string, you have to update the tree, or whatever data
 structure you chose.  Pain, pain, pain.

People in Japan/China/Korea have been using multi-byte encoding for
long time. I personally have used it for more 10 years. I never feel
much of the "pain". Do you think I are using my computer with O(n)
while you are using it with O(1)? There are 100 million people using
variable-length encoding!!!

Take this example, in Chinese every character has the same width, so
it is very easy to format paragraphs and lines. Most English web pages
are rendered using "Times New Roman", which is a variable-width font.
Do you think the English pages are rendered O(n) while Chinese page
are rendered O(1)?

As I said there are many more hard problems than UTF-8. If you want
to support i18n and l10n, you have to live with it. If not, just
forget about the whole thing.

Hong




Re: Please shoot down this GC idea...

2001-02-15 Thread Ken Fox

Damien Neil wrote:
DN {
DNmy $fh = IO::File-new("file");
DNdo_stuff($fh);
DN }
DN
DN sub do_stuff { ... }

Simon Cozens wrote:
SC No, it can't, but it can certainly put a *test* for not having
SC references there.

Dan Sugalski wrote:
DS Yes it can tell, actually--we do have the full bytecode to the sub
DS available to us ...

Dataflow can tell you a lot, but the garbage collector can provide
info too. An object can never point to an object younger than itself.
If the stack is the youngest generation, then whenever something on the
stack gets stored in an older object the stack object ages.

If we still have a young $fh when do_stuff() returns, then the object is
safe to collect as long as we know that the scope owning $fh isn't
returning it. We don't need dataflow for the functions we call; we just
need dataflow for the current scope. (We also have to run a normal
traversing collection on the stack -- it isn't good enough to just
$fh-DESTROY because $fh might be pointed to by another stack object.)

By the way, this is also a way to make finalizers useful most of the
time. We can collect (which means finalize) portions of the youngest
generation at the end of every scope. The only time you'd get hit with
a non-deterministic finalizer is if you ever saved the object in an
old generation.

By the way, a lot of people are confusing a PMC object with a Blessed
Perl Object. To the perl internals everything is an object with a vtbl.
Only some of those objects will be Blessed Perl Objects.

- Ken



Re: PDD 2: sample add()

2001-02-15 Thread Ken Fox

David Mitchell wrote:
 To get my head round PDD 2, I've just written the the outline
 for the body of the add() method for a hypophetical integer PMC class:

[... lots of complex code ...]

I think this example is a good reason to consider only having one
argument math ops. Instead of dst-add(arg1, arg2) why not just have
dst-add(arg)? Then the PVM can generate code that does the right
thing considering the types of all values in an expression.

It doesn't affect the ability to overload at all -- we just move
overloading to an earlier stage of compilation, i.e. before we emit
PVM instructions.

Examples:

  Perl code:  $x = 1 + 2
  Parse tree: op_assign($x, op_add(1, 2))
  PVM code:   $x = 1; $x += 2

  Perl code:  $x = 1 + 2 + 3
  Parse tree: op_assign($x, op_add(op_add(1, 2), 3))
  PVM code:   new $t; $t = 1; $t += 2; $x = $t; $t += 3

It will be more work for the optimizer, but I think it will produce
much more understandable PMC objects.

- Ken



Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Ken Fox

Hong Zhang wrote:
 The memory barriers are always needed on SMP, whatever algorithm
 we are using.

I was just pointing out that barriers are an alternative to mutexes.
Ref count certainly would use mutexes instead of barriers.

 The memory barrier can be easily coded in assembly, or intrinsic
 function, such as __MB() on Alpha.

Perl ain't Java! We have to worry about XS code written in plain
old scary C. If we see some really amazing performance improvements
then I could imagine going with barriers, but I'm doubtful about
their portability and fragility.

Hmm. I just remembered the other GC technique that is very
fragile: ref counts. Maybe fragility isn't a problem after all. ;)

- Ken



Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Dan Sugalski

At 02:08 PM 2/15/2001 -0800, Hong Zhang wrote:
  Hong Zhang wrote:
 
   This code should NEVER work, period. People will just ask for trouble
   with this kind of code.
 
  Actually I meant to have specified "" as the mode, i.e. append, then
  what I originally said holds true.  This behaviour is predictable and
  dependable in the current perl implementation.  Without the  the file
  will contain just "bar\n".

That was not what I meant. Your code already assume the existence of
reference counting. It does not work well with any other kind of garbage
collection. If you translate the same code into C without putting in
the close(), the code will not work at all.

People are getting garbage collection and perl's "object going out of 
scope" behaviour confused. This is starting to annoy me.

Refcounts are not in any way required for perl's leaving scope behaviour. 
They're a convenient way to implement it, but it isn't the only way, and it 
isn't necessarily the best, either.

By the way, in order to use perl in real native thread systems, we have
to use atomic operation for increment/decrement reference count.

Only for shared variables. And an atomic operation is rather a fuzzy thing 
anyway. (With POSIX thread support, we can build some darned big atoms) We 
certainly aren't forced to use single machine instructions to do this.


Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: Garbage collection (was Re: JWZ on s/Java/Perl/)

2001-02-15 Thread Dan Sugalski

At 09:13 PM 2/15/2001 -0500, Ken Fox wrote:
Hong Zhang wrote:
  The memory barriers are always needed on SMP, whatever algorithm
  we are using.

I was just pointing out that barriers are an alternative to mutexes.
Ref count certainly would use mutexes instead of barriers.

Not really they aren't. Barriers are an intrinsic part of most mutexes. 
POSIX ones at least, by definition. Pretty much everyone else's mutexes as 
well.

  The memory barrier can be easily coded in assembly, or intrinsic
  function, such as __MB() on Alpha.

Perl ain't Java! We have to worry about XS code written in plain
old scary C. If we see some really amazing performance improvements
then I could imagine going with barriers, but I'm doubtful about
their portability and fragility.

To some extent extensions are going to be on their own with respect to 
threads, and there's nothing we can do about that. (No matter how hard we 
try, we can't make Oracle's OCI interface do all our Weird Magic Stuff 
automatically) Writing threaded extension code shouldn't be that hard in 
the common case, so it'll be in our best interest to help, but there will 
be limits.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk




Re: string encoding

2001-02-15 Thread Dan Sugalski

At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote:
  ...and because of this you can't randomly access the string, you are
  reduced to sequential access (*).  And here I thought we could have
  left tape drives to the last millennium.
 
  (*) Yes, of course you could cache your sequential access so you only
  need to do it once, and build balanced trees and whatnot out of those
  offsets to have random access emulated in O(n lg n), but as soon as
  you update the string, you have to update the tree, or whatever data
  structure you chose.  Pain, pain, pain.

People in Japan/China/Korea have been using multi-byte encoding for
long time. I personally have used it for more 10 years. I never feel
much of the "pain". Do you think I are using my computer with O(n)
while you are using it with O(1)? There are 100 million people using
variable-length encoding!!!

Not at this level they aren't. The people actually writing the code do feel 
the pain, and you do pay a computational price. You can't *not* pay the price.

   substr($foo, 233253, 14)

is going to cost significantly more with variable sized characters than 
fixed sized ones.

Take this example, in Chinese every character has the same width, so
it is very easy to format paragraphs and lines. Most English web pages
are rendered using "Times New Roman", which is a variable-width font.
Do you think the English pages are rendered O(n) while Chinese page
are rendered O(1)?

You need a better example, since that one's rather muddy. It's a matter of 
characters per word, not pixels per character. But generally speaking, 
Chinese pages will be rendered with less computational cost associated with 
the layout than pages with variable-width characters.

As I said there are many more hard problems than UTF-8. If you want
to support i18n and l10n, you have to live with it.

No, we don't. We do *not* have to live with it at all. That UTF-8 is a 
variable-length representation is an implementation detail, and one we are 
not required to live with internally. If UTF-16 (which is also variable 
width, annoyingly) or UTF-32 (which doesn't officially exist as far as I 
can tell, but we can define by fiat) is better for us, then great. They're 
all just different ways of representing Unicode abstract characters. (I 
think--I'm only up to chapter 3 of the unicode 3.0 book)

Besides, I think you're arguing a completely different point, and I think 
it's been missed generally. Where we're going to get bit hard, and I can't 
see a way around, is combining characters. The individual Unicode abstract 
characters can have a fixed-width representation, but the number of Unicode 
characters per 'real' character is variable, and I can't see any way around 
that. (It looks like it's legal to stack four or six modifier characters on 
a base character, and I don't think I'm willing to go so far as to use 
UTF-128 internally. That's a touch much, even for me... :) Then there also 
seems to be metadata embedded in the Unicode standard--stuff like the 
bidirectional ordering and alternate formatting characters. Bleah.

It looks like, for us to do Unicode properly with all the world's 
languages, we might have to have a tagged text format like we've been 
talking about for other things (XML and suchlike stuff). And I'm not 
anywhere near sure what we should do for substitutions. If you have the 
sequence:

   LATIN SMALL LETTER A, COMBINING TILDE

and do a s/a/b/, should you then have

   LATIN SMALL LETTER B, COMBINING TILDE

and if not, if you do a s/SMALL LETTER A WITH TILDE/q/ on the sequence, 
should you end up with

   LATIN SMALL LETTER Q

or not? The original sequence was two separate characters and the match was 
one, but they are really the same thing.

Unicode is making my head hurt. I do *not* have an appropriate language 
background to feel comfortable with this, even for the pieces that are 
relevant to the languages I have any familiarity with.

Dan

--"it's like this"---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk