Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Hong Zhang A deterministic finalization means we shouldn't need to force programmers to have good ideas. Make it easy, remember? :) I don't believe such an algorithm exists, unless you stick with reference count. Either doesn't exist, or is more expensive than refcounting. I guess we have to make a decision between deterministic finalization and not using refcounting as GC, because both together sure don't exist. And don't forget that if we stick with refcounting, we should try to find a way to break circular references, too. - Branden
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
On Thu, Feb 15, 2001 at 08:21:03AM -0300, Branden wrote: Hong Zhang A deterministic finalization means we shouldn't need to force programmers to have good ideas. Make it easy, remember? :) I don't believe such an algorithm exists, unless you stick with reference count. Either doesn't exist, or is more expensive than refcounting. I guess we have to make a decision between deterministic finalization and not using refcounting as GC, because both together sure don't exist. And don't forget that if we stick with refcounting, we should try to find a way to break circular references, too. As a part of that the weak reference concept, bolted recently into perl5, could be made more central in perl6. Around 92.769% of the time circular references are known to be circular by the code that creates them (like a 'handy' ref back to a parent node). Having a weakref, or similar, operator in the language would help greatly. Tim.
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Tim Bunce wrote: On Thu, Feb 15, 2001 at 08:21:03AM -0300, Branden wrote: And don't forget that if we stick with refcounting, we should try to find a way to break circular references, too. As a part of that the weak reference concept, bolted recently into perl5, could be made more central in perl6. Around 92.769% of the time circular references are known to be circular by the code that creates them (like a 'handy' ref back to a parent node). Having a weakref, or similar, operator in the language would help greatly. Do weakrefs really work on Perl 5? I know they're not incremented when they are created, but aren't they decremented (and try to free the object) when they go out of scope? What happens if the object goes out of scope before the variable with the weakref does? Weakrefs are probably useful to break circular references in 99% of the cases. But we must make sure they work properly! And also that bugs while using them don't dump core, at most throw exceptions. - Branden
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Damien Neil wrote: On Thu, Feb 15, 2001 at 08:07:39AM -0300, Branden wrote: I think you just said all about why we shouldn't bother giving objects deterministic finalization, and I agree with you. If we explicitly want to free resources (files, database connections), then we explicitly call close. Otherwise, it will be called when DESTROY is eventually called. No, the question of whether Perl 6 wants deterministic finalization or not is a separate one. If it doesn't have it, we will be losing a very common Perl idiom: { my $fh = IO::File-new("file"); print $fh $data; } Re-read what you wrote in http:[EMAIL PROTECTED]/msg02468.html. I think you've got to decide what you want. Do you want smart GC (without deterministic finalization) and free resources explicitly on special cases? Or do you want to keep with common Perl idioms (what probably means ref-counting)? I would say I probably prefer refcounting (with some kind of breaking circular references algorythm), because I see the advantages it brings pay its price. It's nice to know that when the above block exits, $fh will be closed. Remember that "closed" doesn't just refer to freeing the resources associated with it -- it also includes flushing buffers and the like. Just set autoflush, if you're lazy... Without deterministic finalization, you will almost always want to write the above to include an explicit $fh-close(). Exactly. The problem is that you can not only count on $fh's DESTROY being called at the end of the block, you often can't count on it ever happening. Anyway, the file would be flushed and closed... Consider the case where the interpreter dies on a signal, for example -- DESTROY methods will quite possibly not be called. Actually, I think this can be worked around. Can't it catch signals? I'm not certain that Perl should lose deterministic finalization. On the other hand, I really wish that Perl had a more modern GC scheme, if only so that circular structures could be properly collected. Agree. In the end, however, I don't think that any of our opinions will decide this -- either Dan's forthcoming PDD will show how Perl 6 can have its cake and eat it too, or Larry will decide. OK for me. - Branden - Damien
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Branden wrote: Just set autoflush, if you're lazy... And say goodbye to performance... The problem is that you can not only count on $fh's DESTROY being called at the end of the block, you often can't count on it ever happening. Anyway, the file would be flushed and closed... That's not sufficient. Without deterministic finalisation, what does the folowing do? { my $fh = IO::File-new("file"); print $fh "foo\n"; } { my $fh = IO::File-new("file"); print $fh "bar\n"; } At present "file" will contain "foo\nbar\n". Without DF it could just as well be "bar\nfoo\n". Make no mistake, this is a major change to the semantics of perl. Alan Burlison
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
{ my $fh = IO::File-new("file"); print $fh "foo\n"; } { my $fh = IO::File-new("file"); print $fh "bar\n"; } At present "file" will contain "foo\nbar\n". Without DF it could just as well be "bar\nfoo\n". Make no mistake, this is a major change to the semantics of perl. Alan Burlison This code should NEVER work, period. People will just ask for trouble with this kind of code. The DF never exists, even with reference count. Can anyone show me how to deterministically collect circular reference? The current semantics of perl works most of time, but not always. What we really are talking about is "Shall Perl provide 90% or 99% of DF?" The operating system provides 0% during runtime, 100% at process exit. Hong
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Hong Zhang wrote: This code should NEVER work, period. People will just ask for trouble with this kind of code. Actually I meant to have specified "" as the mode, i.e. append, then what I originally said holds true. This behaviour is predictable and dependable in the current perl implementation. Without the the file will contain just "bar\n". The point is that we have a stated goal of preserving the existing semantics, and of allowing existing perl5 code to continue to work. Despite what some people seem to think this is *not* a clean slate situation. We may well have to deliberately carry over questionable but depended-upon behaviour into perl6. my $fh = do { local *FH; *FH; } for example, better continue to work. Alan Burlison
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Hong Zhang wrote: This code should NEVER work, period. People will just ask for trouble with this kind of code. Actually I meant to have specified "" as the mode, i.e. append, then what I originally said holds true. This behaviour is predictable and dependable in the current perl implementation. Without the the file will contain just "bar\n". That was not what I meant. Your code already assume the existence of reference counting. It does not work well with any other kind of garbage collection. If you translate the same code into C without putting in the close(), the code will not work at all. By the way, in order to use perl in real native thread systems, we have to use atomic operation for increment/decrement reference count. On most systems I have measured (pc and sparc), any atomic operation takes about 0.1-0.3 micro second, and it will be even worse on large SMP machines. The latest garbage collection algorithms (parallel and cocurrent) can handle large memory pretty well. The cost will be less DF. Hong
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Hong Zhang wrote: That was not what I meant. Your code already assume the existence of reference counting. It does not work well with any other kind of garbage collection. If you translate the same code into C without putting in the close(), the code will not work at all. Wrong, it does *not* assume any such thing. It assumes that when a filehandle goes out of scope it is closed. How that is achieved is a detail of the implementation, and could be done in a number of ways. It could just as well be done by keeping the filehandle on a stack which was cleared when the scope exits. C++ does this for local variables without requiring a refcount. By the way, in order to use perl in real native thread systems, we have to use atomic operation for increment/decrement reference count. On most systems I have measured (pc and sparc), any atomic operation takes about 0.1-0.3 micro second, and it will be even worse on large SMP machines. The latest garbage collection algorithms (parallel and cocurrent) can handle large memory pretty well. The cost will be less DF. I think you'll find that both GC *and* reference counting scheme will require the heay use of mutexes in a MT program. Alan Burlison
string encoding
Hi, All, I want to give some of my thougts about string encoding. Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like class String { virtual UV iterate(/*inout*/ int* index); }; So in typical string iteration, the code will looks like for (i = 0; i size;) { UV ch = s-iterate(i); /* do what u want */ } instead of for (i = 0; i size; i++) { uint32 ch = s-charAt(i); /* be my guest */ } The new style will be strange, but not very difficult to use. It also hide the internal representation. The UTF-32 suggestion is largely ignorant to internationalization. Many user characters are composed by more than one unicode code point. If you consider the unicode normalization, canonical form, hangul conjoined, hindic cluster, combining character, varama, collation, locale, UTF-32 will not help you much, if at all. Hong
Adoption ??: Rare Salt-Water Camel May Be Separate Species
http://news.bbc.co.uk/hi/english/sci/tech/newsid_1156000/1156212.stm This nuclear/dynamite stuff is making me sad. Wanna contribute in the name of perl ?? Lets see... China + UN = $perl_revenue
Re: string encoding
On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like I'm expecting that the virtual, internal representation will not be in a UTF but will simply be an array of codepoints. Manipulating UTF8 internally is horrible because it's a variable length encoding, so you need to keep track of where you are both in terms of characters and bytes. Yuck, yuck, yuck. -- Calm down, it's *only* ones and zeroes.
Re: string encoding
On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like I'm expecting that the virtual, internal representation will not be in a UTF but will simply be an array of codepoints. Manipulating UTF8 internally is horrible because it's a variable length encoding, so you need to keep track of where you are both in terms of characters and bytes. Yuck, yuck, yuck. I am not sure if you have read through my email. The concept of characters have nothing to do with codepoints. Many characters are composed by more than one codepoints. The concept of character position is completely useless in many languages. Many languages just don't have the English-style "character", see collation, hungul conjoined, combining characters. There is just no easy way to keep track of character position. What you really meant was probably the codepoint position. The codepoint position is largely internal to library. As long as regular expression can efficiently handle utf-8, (as it does now), most people will feel just fine with it. There are just not many people interested in the codepoint position, if they ever heard of it. They care more about m// or s///. Even you want to keep track the character offsets, it is still much easier than many other unicode features I mentioned. Hong
Re: string encoding
On Thu, Feb 15, 2001 at 11:16:29PM +, Simon Cozens wrote: On Thu, Feb 15, 2001 at 02:31:03PM -0800, Hong Zhang wrote: Personally I like the UTF-8 encoding. The solution to the variable length can be handled by a special (virtual) function like I'm expecting that the virtual, internal representation will not be in a UTF but will simply be an array of codepoints. Manipulating UTF8 internally is horrible because it's a variable length encoding, so you need to keep track of where you are both in terms of characters and bytes. Yuck, yuck, yuck. ...and because of this you can't randomly access the string, you are reduced to sequential access (*). And here I thought we could have left tape drives to the last millennium. (*) Yes, of course you could cache your sequential access so you only need to do it once, and build balanced trees and whatnot out of those offsets to have random access emulated in O(n lg n), but as soon as you update the string, you have to update the tree, or whatever data structure you chose. Pain, pain, pain. -- Calm down, it's *only* ones and zeroes. I wish more people would keep this in mind. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: string encoding
On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote: The concept of characters have nothing to do with codepoints. Many characters are composed by more than one codepoints. This isn't true. -- * DrForr digs around for a fresh IV drip bag and proceeds to hook up. dngor Coffee port. DrForr Firewalled, like everything else around here.
Re: string encoding
On Thu, Feb 15, 2001 at 03:59:54PM -0800, Hong Zhang wrote: The concept of characters have nothing to do with codepoints. Many characters are composed by more than one codepoints. This isn't true. What do you mean? Have you seen people using multi-byte encoding in Japan/China/Korea? Hong
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Alan Burlison wrote: I think you'll find that both GC *and* reference counting scheme will require the heay use of mutexes in a MT program. There are several concurrent GC algorithms that don't use mutexes -- but they usually depend on read or write barriers which may be really hard for us to implement. Making them run well always requires help from the OS memory manager and that would hurt portability. (If we don't have OS support it means auditing everybody's XS code to make sure they use wrappers with barrier checks on all writes. Yuck.) - Ken
Re: string encoding
...and because of this you can't randomly access the string, you are reduced to sequential access (*). And here I thought we could have left tape drives to the last millennium. (*) Yes, of course you could cache your sequential access so you only need to do it once, and build balanced trees and whatnot out of those offsets to have random access emulated in O(n lg n), but as soon as you update the string, you have to update the tree, or whatever data structure you chose. Pain, pain, pain. People in Japan/China/Korea have been using multi-byte encoding for long time. I personally have used it for more 10 years. I never feel much of the "pain". Do you think I are using my computer with O(n) while you are using it with O(1)? There are 100 million people using variable-length encoding!!! Take this example, in Chinese every character has the same width, so it is very easy to format paragraphs and lines. Most English web pages are rendered using "Times New Roman", which is a variable-width font. Do you think the English pages are rendered O(n) while Chinese page are rendered O(1)? As I said there are many more hard problems than UTF-8. If you want to support i18n and l10n, you have to live with it. If not, just forget about the whole thing. Hong
Re: Please shoot down this GC idea...
Damien Neil wrote: DN { DNmy $fh = IO::File-new("file"); DNdo_stuff($fh); DN } DN DN sub do_stuff { ... } Simon Cozens wrote: SC No, it can't, but it can certainly put a *test* for not having SC references there. Dan Sugalski wrote: DS Yes it can tell, actually--we do have the full bytecode to the sub DS available to us ... Dataflow can tell you a lot, but the garbage collector can provide info too. An object can never point to an object younger than itself. If the stack is the youngest generation, then whenever something on the stack gets stored in an older object the stack object ages. If we still have a young $fh when do_stuff() returns, then the object is safe to collect as long as we know that the scope owning $fh isn't returning it. We don't need dataflow for the functions we call; we just need dataflow for the current scope. (We also have to run a normal traversing collection on the stack -- it isn't good enough to just $fh-DESTROY because $fh might be pointed to by another stack object.) By the way, this is also a way to make finalizers useful most of the time. We can collect (which means finalize) portions of the youngest generation at the end of every scope. The only time you'd get hit with a non-deterministic finalizer is if you ever saved the object in an old generation. By the way, a lot of people are confusing a PMC object with a Blessed Perl Object. To the perl internals everything is an object with a vtbl. Only some of those objects will be Blessed Perl Objects. - Ken
Re: PDD 2: sample add()
David Mitchell wrote: To get my head round PDD 2, I've just written the the outline for the body of the add() method for a hypophetical integer PMC class: [... lots of complex code ...] I think this example is a good reason to consider only having one argument math ops. Instead of dst-add(arg1, arg2) why not just have dst-add(arg)? Then the PVM can generate code that does the right thing considering the types of all values in an expression. It doesn't affect the ability to overload at all -- we just move overloading to an earlier stage of compilation, i.e. before we emit PVM instructions. Examples: Perl code: $x = 1 + 2 Parse tree: op_assign($x, op_add(1, 2)) PVM code: $x = 1; $x += 2 Perl code: $x = 1 + 2 + 3 Parse tree: op_assign($x, op_add(op_add(1, 2), 3)) PVM code: new $t; $t = 1; $t += 2; $x = $t; $t += 3 It will be more work for the optimizer, but I think it will produce much more understandable PMC objects. - Ken
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
Hong Zhang wrote: The memory barriers are always needed on SMP, whatever algorithm we are using. I was just pointing out that barriers are an alternative to mutexes. Ref count certainly would use mutexes instead of barriers. The memory barrier can be easily coded in assembly, or intrinsic function, such as __MB() on Alpha. Perl ain't Java! We have to worry about XS code written in plain old scary C. If we see some really amazing performance improvements then I could imagine going with barriers, but I'm doubtful about their portability and fragility. Hmm. I just remembered the other GC technique that is very fragile: ref counts. Maybe fragility isn't a problem after all. ;) - Ken
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
At 02:08 PM 2/15/2001 -0800, Hong Zhang wrote: Hong Zhang wrote: This code should NEVER work, period. People will just ask for trouble with this kind of code. Actually I meant to have specified "" as the mode, i.e. append, then what I originally said holds true. This behaviour is predictable and dependable in the current perl implementation. Without the the file will contain just "bar\n". That was not what I meant. Your code already assume the existence of reference counting. It does not work well with any other kind of garbage collection. If you translate the same code into C without putting in the close(), the code will not work at all. People are getting garbage collection and perl's "object going out of scope" behaviour confused. This is starting to annoy me. Refcounts are not in any way required for perl's leaving scope behaviour. They're a convenient way to implement it, but it isn't the only way, and it isn't necessarily the best, either. By the way, in order to use perl in real native thread systems, we have to use atomic operation for increment/decrement reference count. Only for shared variables. And an atomic operation is rather a fuzzy thing anyway. (With POSIX thread support, we can build some darned big atoms) We certainly aren't forced to use single machine instructions to do this. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Garbage collection (was Re: JWZ on s/Java/Perl/)
At 09:13 PM 2/15/2001 -0500, Ken Fox wrote: Hong Zhang wrote: The memory barriers are always needed on SMP, whatever algorithm we are using. I was just pointing out that barriers are an alternative to mutexes. Ref count certainly would use mutexes instead of barriers. Not really they aren't. Barriers are an intrinsic part of most mutexes. POSIX ones at least, by definition. Pretty much everyone else's mutexes as well. The memory barrier can be easily coded in assembly, or intrinsic function, such as __MB() on Alpha. Perl ain't Java! We have to worry about XS code written in plain old scary C. If we see some really amazing performance improvements then I could imagine going with barriers, but I'm doubtful about their portability and fragility. To some extent extensions are going to be on their own with respect to threads, and there's nothing we can do about that. (No matter how hard we try, we can't make Oracle's OCI interface do all our Weird Magic Stuff automatically) Writing threaded extension code shouldn't be that hard in the common case, so it'll be in our best interest to help, but there will be limits. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: string encoding
At 05:09 PM 2/15/2001 -0800, Hong Zhang wrote: ...and because of this you can't randomly access the string, you are reduced to sequential access (*). And here I thought we could have left tape drives to the last millennium. (*) Yes, of course you could cache your sequential access so you only need to do it once, and build balanced trees and whatnot out of those offsets to have random access emulated in O(n lg n), but as soon as you update the string, you have to update the tree, or whatever data structure you chose. Pain, pain, pain. People in Japan/China/Korea have been using multi-byte encoding for long time. I personally have used it for more 10 years. I never feel much of the "pain". Do you think I are using my computer with O(n) while you are using it with O(1)? There are 100 million people using variable-length encoding!!! Not at this level they aren't. The people actually writing the code do feel the pain, and you do pay a computational price. You can't *not* pay the price. substr($foo, 233253, 14) is going to cost significantly more with variable sized characters than fixed sized ones. Take this example, in Chinese every character has the same width, so it is very easy to format paragraphs and lines. Most English web pages are rendered using "Times New Roman", which is a variable-width font. Do you think the English pages are rendered O(n) while Chinese page are rendered O(1)? You need a better example, since that one's rather muddy. It's a matter of characters per word, not pixels per character. But generally speaking, Chinese pages will be rendered with less computational cost associated with the layout than pages with variable-width characters. As I said there are many more hard problems than UTF-8. If you want to support i18n and l10n, you have to live with it. No, we don't. We do *not* have to live with it at all. That UTF-8 is a variable-length representation is an implementation detail, and one we are not required to live with internally. If UTF-16 (which is also variable width, annoyingly) or UTF-32 (which doesn't officially exist as far as I can tell, but we can define by fiat) is better for us, then great. They're all just different ways of representing Unicode abstract characters. (I think--I'm only up to chapter 3 of the unicode 3.0 book) Besides, I think you're arguing a completely different point, and I think it's been missed generally. Where we're going to get bit hard, and I can't see a way around, is combining characters. The individual Unicode abstract characters can have a fixed-width representation, but the number of Unicode characters per 'real' character is variable, and I can't see any way around that. (It looks like it's legal to stack four or six modifier characters on a base character, and I don't think I'm willing to go so far as to use UTF-128 internally. That's a touch much, even for me... :) Then there also seems to be metadata embedded in the Unicode standard--stuff like the bidirectional ordering and alternate formatting characters. Bleah. It looks like, for us to do Unicode properly with all the world's languages, we might have to have a tagged text format like we've been talking about for other things (XML and suchlike stuff). And I'm not anywhere near sure what we should do for substitutions. If you have the sequence: LATIN SMALL LETTER A, COMBINING TILDE and do a s/a/b/, should you then have LATIN SMALL LETTER B, COMBINING TILDE and if not, if you do a s/SMALL LETTER A WITH TILDE/q/ on the sequence, should you end up with LATIN SMALL LETTER Q or not? The original sequence was two separate characters and the match was one, but they are really the same thing. Unicode is making my head hurt. I do *not* have an appropriate language background to feel comfortable with this, even for the pieces that are relevant to the languages I have any familiarity with. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk