Q: MMD and non PMC value (was: keyed vtables and mmd)
Dan Sugalski [EMAIL PROTECTED] wrote: ... And... we move *all* the operator functions out of the vtable and into the MMD system. All of it. This *all* includes vtable functions like add_int() or add_float() too, I presume. For these we have left argument dispatch only. But what is the right argument? A PerlInt, TclInt, PyInt (or ..Float)? Or is it assumed to be the same as the left argument type? leo
Re: Bit ops on strings
The bitshift operations on S-register contents are valid, so long as the thing hanging off the register support it. Binary data ought allow this. Most 8-bit string encodings will have to support it whether it's a good idea or not, since you can do it now. If Jarkko tells me you can do bitwise operations with unicode text now in Perl 5, well... we'll support it there, too, though we shan't like it at all. We can and I don't like it at all :-) What they basically operate on are the internal UTF-8 bit patterns, in other words utter crapola from the viewpoint of traditional bit strings. Especially fun was getting the semantics of ~ to make any sense whatsoever. None of it anything I want to propagate anywhere. I *think* most of the variable-width encodings, and the character sets that sit on top of them, can reasonably forbid this.
RE: Bit ops on strings
On Fri, 2004-04-30 at 13:53, Dan Sugalski wrote: Parrot, at the very low levels, makes no distinction between strings and buffers--as far as it's concerned they're the same thing, and either can hang off an S register. (Ultimately, when *I* talk of strings I mean A thing I can hang off an S register, though I'm in danger of turning into Humpty Dumpty here) That's part of the problem. There are already bitwise operations on S-register things in the core, which is OK. Ahhh, now things are beginning to make a little more sense. Bear with me for a question or two more. The bitshift operations on S-register contents are valid, so long as the thing hanging off the register support it. Binary data ought allow this. Most 8-bit string encodings will have to support it whether it's a good idea or not, since you can do it now. If Jarkko tells me you can do bitwise operations with unicode text now in Perl 5, well... we'll support it there, too, though we shan't like it at all. I *think* most of the variable-width encodings, and the character sets that sit on top of them, can reasonably forbid this. mode=dave barry Since text strings are a proper subset of a binary buffer, which is really what the string registers really are, what we've logically got is this: /mode LAYER 1 2 3 +-- Text Ops --- (Hosted Language) SREG --+ +-- Bin Ops --- (Hosted Language) or maybe this: SREG --- Bin Ops --- (Hosted Language) +-- Text Ops --- (Hosted Language) where semantics are found in Layers 2 and 3. (Layer 3 could also be merged.) Now I think that's more less what Parrot has, right? Except that the Layer 2 semantics are tracked (and locked in?) at Layer 1? (To prevent the aforementioned bit-shifting of WTF strings.) -- Bryan C. Warnock bwarnock@(gtemail.net|raba.com)
RE: Bit ops on strings
On Fri, 2004-04-30 at 15:34, Dan Sugalski wrote: If you want, you could think of the S-register strings as mini-PMCs. The encoding and charset stuff (we'll ignore language semantics for the moment) are essentially small vtables that hang off the string, and whatever we do with it mostly goes through those vtable functions. Yeah, I was thinking that perhaps all the non-buffer semantics should have been a PMC (that then wrapped and used the SREG with its byte-buffer semantics). The PMCs would then be as lax or strict with text semantics as it needed to be, without causing semantic interference to different language's needs. Everything's possible with enough abstraction, and all that. Slow and bulky, though. Oh, well, two years late and a dollar short. Which sort of argues for putting the bitstring stuff in there somewhere as well. (And may well argue for MMD on string operations, but I think that makes my head hurt so I'm not going there righ tnow) Good 'nuff. Thanks, -- Bryan C. Warnock bwarnock@(gtemail.net|raba.com)
MMD syntax in PMCs (was: keyed vtables and mmd)
Dan Sugalski [EMAIL PROTECTED] wrote: ... We rework the current pmc processor to take the entries that are getting tossed and automatically add them to the MMD tables on PMC load instead. I've now implemented MMD for PerlInt's bitwise_xor as a test case. Syntax looks like this: void bitwise_xor (PMC* value, PMC* dest) { MMD_PerlInt: { VTABLE_set_integer_native(INTERP, dest, PMC_int_val(SELF) ^ PMC_int_val(value)); } MMD_DEFAULT: { VTABLE_set_integer_native(INTERP, dest, PMC_int_val(SELF) ^ VTABLE_get_integer(INTERP, value)); } } This creates two functions: Parrot_PerlInt_bitwise_xor() Parrot_PerlInt_bitwise_xor_PerlInt() with the body parts from above and these initializer code snippet: { MMD_BXOR, enum_class_PerlInt, 0, (funcptr_t) Parrot_PerlInt_bitwise_xor }, { MMD_BXOR, enum_class_PerlInt, enum_class_PerlInt, (funcptr_t) Parrot_PerlInt_bitwise_xor_PerlInt } leo
Re: Bit ops on strings
On Sat, 2004-05-01 at 04:57, Jarkko Hietaniemi wrote: If Jarkko tells me you can do bitwise operations with unicode text now in Perl 5, well... we'll support it there, too, though we shan't like it at all. We can and I don't like it at all [...] None of it anything I want to propagate anywhere. Please correct me if I'm wrong here, but I'm going to lay out my understanding as a set of assertions: * Parrot will be able to convert any encoding to any other encoding * though, some conversions will result in an exception, that's still a defined behavior * We've agreed that only raw binary 8-bit strings make sense for bit vector operations So it seems to me that the obvious way to go is to have all bit-s operations first convert to raw bytes (possibly throwing an exception) and then proceed to do their work. This means that UTF-8 strings will be handled just fine, and (as I understand it) some subset of Unicode-at-large will be handled as well. In other-words, the burden goes on the conversion functions, not on the bit ops. It's not that it's going to be meaningful in the general case, but if you have code like: sub foo() { return \x01+|\x02 } I would expect the get the bit-string, \x03 back even though strings may default to Unicode in Perl 6. You could put this on the shoulders of the client language (by saying that the operands must be pre-converted, but that seems to be contrary to Parrot's usual MO. Let me know. I'm happy to do it either way, and I'll look at modifying the other bit-string operators if they don't conform to the decision. -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback signature.asc Description: This is a digitally signed message part
Re: Bit ops on strings
So it seems to me that the obvious way to go is to have all bit-s operations first convert to raw bytes (possibly throwing an exception) and then proceed to do their work. If these conversions croak if there are code points beyond \x{ff}, I'm fine with it. But trying to mix \x{100} or higher just leads into silly discontinuities (basically we would need to decide on a word width, and I think that would be a silly move). This means that UTF-8 strings will be handled just fine, and (as I Please don't mix encodings and code points. That strings might be serialized or stored as UTF-8 should have no consequence with bitops. understand it) some subset of Unicode-at-large will be handled as well. In other-words, the burden goes on the conversion functions, not on the bit ops. It's not that it's going to be meaningful in the general case, but if I'd rather have meaningful results. you have code like: sub foo() { return \x01+|\x02 } Please consider what happens when the operands have code points beyond 0xff. I would expect the get the bit-string, \x03 back even though strings may default to Unicode in Perl 6. Of course. But I would expect a horrible flaming death for \x{100}|+\x02. You could put this on the shoulders of the client language (by saying that the operands must be pre-converted, but that seems to be contrary to Parrot's usual MO. Let me know. I'm happy to do it either way, and I'll look at modifying the other bit-string operators if they don't conform to the decision. -- Jarkko Hietaniemi [EMAIL PROTECTED] http://www.iki.fi/jhi/ There is this special biologist word we use for 'stable'. It is 'dead'. -- Jack Cohen
Re: Bit ops on strings
On Sat, 2004-05-01 at 11:26, Jarkko Hietaniemi wrote: As for codepoints outside of \x00-\xff, I vote exception. I don't think there's any other logical choice, but I think it's just an encoding conversion exception, not a special bit-op exception (that's arm-waving, I have not looked at Parrot's exception model yet... miles to go...) This means that UTF-8 strings will be handled just fine, and (as I Please don't mix encodings and code points. That strings might be serialized or stored as UTF-8 should have no consequence with bitops. What I meant was that UTF-8 IS going to be represented in a way that will guarantee you won't get an exception when trying to do bit-ops. All bets are off for many other encodings. While you're right that you might get lucky, that wasn't really the point I was making. Many languages (Perl included, I think) are going to encode strings as UTF-8 by default, and this means that in the general case, we should not expect exceptions to be thrown around any time we do a bit-op and 'A'|'B' will still be 'C' :-) Of course. But I would expect a horrible flaming death for \x{100}|+\x02. Well, if you consider a string conversion exception to be horrible flaming death, then I hate to see what you do with a divide-by-zero ;-) None of your response sounds overly scary to me, so I'll start looking at what Parrot does NOW for bit-string-ops and see if it needs to mutate to fit this model. Then I'll add in the rest. Then I get to see what evil Dan and Leo perform upon my patch ;-) -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback signature.asc Description: This is a digitally signed message part
Re: MMD performance
Leopold Toetsch [EMAIL PROTECTED] wrote: [ another MMD performance compare ] Just an update. Last benchmark still called MMD via the vtable. Here is now a compare of calling MMD from the run loop: $ parrot -C mmd-bench.imc vtbl add PerlInt PerlInt 1.072931 vtbl add PerlInt Integer 1.085116 MMD bxor PerlInt PerlInt 0.849723 MMD bxor PerlInt Integer 0.989387 $ parrot -j mmd-bench.imc vtbl add PerlInt PerlInt 0.685505 vtbl add PerlInt Integer 0.692237 MMD bxor PerlInt PerlInt 0.628078 MMD bxor PerlInt Integer 0.790955 JITed vtable add calls directly into the vtable, while the MMD bxor is still a function that calls mmd_dispatch. Compiled with -O3, 5 Meg operations on Athlon 800. leo
Re: Bit ops on strings
On May 1, 2004, at 8:26 AM, Jarkko Hietaniemi wrote: So it seems to me that the obvious way to go is to have all bit-s operations first convert to raw bytes (possibly throwing an exception) and then proceed to do their work. If these conversions croak if there are code points beyond \x{ff}, I'm fine with it. But trying to mix \x{100} or higher just leads into silly discontinuities (basically we would need to decide on a word width, and I think that would be a silly move). Just FYI, the way I implemented bitwise-not so far, was to bitwise-not code points 0x{00}-0x{FF} as uint8-sized things, 0x{100}-0x{} as uint16-sized things, and 0x{} as uint32-sized things (but then bit-masking them with 0xF to make sure that they fell into a valid code point range). That's pretty arbitrary, but if you bitwise-not as though everything were 32-bits wide, you'll end up with a string containing no assigned code points at all (they'll all be 0x10F). But from a text point of view, bitwise-not on a string isn't a sensible operation no matter how you slice it (that is, even for 0x{00}-0x{FF}), so one flavor of arbitrary is just about as good as any other. We could also make anything 0x{FF} map to either 0x{00} or 0x{FF}, or mask if with 0xFF to push it into that range. It's all pretty meaningless, as text transformations go, and I can't imagine anyone using it for anything, except maybe weak encryption. This means that UTF-8 strings will be handled just fine, and (as I Please don't mix encodings and code points. That strings might be serialized or stored as UTF-8 should have no consequence with bitops. Exactly. And also realize that if you bitwise-not (or shift or something similar) the bytes of a UTF-8 serialization of something, the result isn't going to be valid UTF-8, so you'd be hard-pressed to lay text semantics down on top of it. understand it) some subset of Unicode-at-large will be handled as well. In other-words, the burden goes on the conversion functions, not on the bit ops. It's not that it's going to be meaningful in the general case, but if I'd rather have meaningful results. Exactly--and, meaningful operations to begin with. I'm beginning to wonder if we're going to be square-rooting strings, and taking the array-th root of a hash :) JEff
Re: Bit ops on strings
On Sat, 2004-05-01 at 14:18, Jeff Clites wrote: On May 1, 2004, at 8:26 AM, Jarkko Hietaniemi wrote: Just FYI, the way I implemented bitwise-not so far, was to bitwise-not code points 0x{00}-0x{FF} as uint8-sized things, 0x{100}-0x{} as uint16-sized things, and 0x{} as uint32-sized things (but then bit-masking them with 0xF to make sure that they fell into a valid code point range). That's pretty arbitrary, but if you bitwise-not as though everything were 32-bits wide, you'll end up with a string containing no assigned code points at all (they'll all be 0x10F). But from a text point of view, bitwise-not on a string isn't a sensible operation no matter how you slice it (that is, even for 0x{00}-0x{FF}), so one flavor of arbitrary is just about as good as any other. We could also make anything 0x{FF} map to either 0x{00} or 0x{FF}, or mask if with 0xFF to push it into that range. It's all pretty meaningless, as text transformations go, and I can't imagine anyone using it for anything, except maybe weak encryption. I think Dan and I were both thinking in terms of bit-vector operations on byte-streams for any purpose that would require such a beast. In Perl, you have the vec function to make this slightly easier. This is one of those places where thinking about strings as text is highly misleading. They're used for an awful lot more. Exactly. And also realize that if you bitwise-not (or shift or something similar) the bytes of a UTF-8 serialization of something, the result isn't going to be valid UTF-8, so you'd be hard-pressed to lay text semantics down on top of it. How are you defining valid UTF-8? Is there a codepoint in UTF-8 between \x00 and \xff that isn't valid? Is there a reason to ever do bitwise operations on anything other than 8-bit codepoints? I'm beginning to wonder if we're going to be square-rooting strings, and taking the array-th root of a hash :) Strings are not numbers, but there's a heck of a lot of code out there that treats existing strings as bit-vectors (note: bit vectors are not numbers either), and that code needs to be supported, no? Now, shift operations aren't usually part of the package, but I figured that as long as we were going to have the rest of the bit-manipulators, finishing off the set would be of value. More to the point, I said all of this at the beginning of this thread. You should not, at this point, be confused about the scope of what I want to do, as it was very narrowly and clearly defined up-front. -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback signature.asc Description: This is a digitally signed message part
Re: Bit ops on strings
How are you defining valid UTF-8? Is there a codepoint in UTF-8 between \x00 and \xff that isn't valid? Is there a reason to ever do Like, half of them? \x80 .. \xff are all invalid as UTF-8. bitwise operations on anything other than 8-bit codepoints? I am very confused. THIS IS WHAT WE ALL SEEM TO BE SAYING. BITOPS ONLY ON EIGHT-BIT DATA. AM I WRONG?
Re: Bit ops on strings
On May 1, 2004, at 12:00 PM, Aaron Sherman wrote: On Sat, 2004-05-01 at 14:18, Jeff Clites wrote: Exactly. And also realize that if you bitwise-not (or shift or something similar) the bytes of a UTF-8 serialization of something, the result isn't going to be valid UTF-8, so you'd be hard-pressed to lay text semantics down on top of it. How are you defining valid UTF-8? Is there a codepoint in UTF-8 between \x00 and \xff that isn't valid? Is there a reason to ever do bitwise operations on anything other than 8-bit codepoints? If you're dealing in terms of code points, then the UTF-8 encoding (or any other) has nothing to do with it. If you are dealing in terms of bytes, then there are bytes sequences which don't encode any code point in the UTF-8 encoding. By valid UTF-8, I'm referring to the definition of that encoding (and I should have said, well-formed)--see section 3.9, item D36 of the Unicode Standard. In particular, bytes 0xC0, 0xC1, and 0xF5-0xFF cannot occur in UTF-8. But if you're speaking in terms of code points, that's not relevant, but then neither is the encoding. More to the point, I said all of this at the beginning of this thread. You should not, at this point, be confused about the scope of what I want to do, as it was very narrowly and clearly defined up-front. And yet, I am confused. You said near the beginning of the thread: On Fri, 2004-04-30 at 10:42, Dan Sugalski wrote: Bitstring operations ought only be valid on binary data, though, unless someone can give me a good reason why we ought to allow bitshifting on Unicode. (And then give me a reasoned argument *how*, too) 100% agree. If you want to play games with any other encoding, you may proceed to write your own damn code ;-) Given that, I'm not sure how UTF-8 is coming into the picture. JEff
[perl #29299] [PATCH] MSWin32 Fix spawn stdout handling and PerlNum.get_string()
# New Ticket Created by Ron Blaschke # Please include the string: [perl #29299] # in the subject line of all future correspondence about this issue. # URL: http://rt.perl.org:80/rt3/Ticket/Display.html?id=29299 spawn on win32 should inherit the filehandles to the child process, because the child is supposed to write on the parents stdout. (t/pmc/sys.t#1) config/gen/platform/win32/exec.c PerlNum.get_string() should print -0.00 for the value -0.0, but prints 0.00 on win32. get_string() now prints the sign symbol itself, instead of relying on sn?printf. (t/pmc/perlnum.t#36) classes/perlnum.pmc mswin32_spawn_and_perlnum.patch Description: Binary data
Re: Win32 build fails on src\interpreter.str
On Tue, 27 Apr 2004 10:09:43 +0200, Leopold Toetsch wrote: Does anyone need the Edit and Continue feature? If yes, it can be easily turned on in the local Makefile. Just a final remark that just popped up: Since parrot doesn't compile with -ZI (because of __LINE__), it would make little sense to enable it in the Makefile. For me that's ok, as I have never used it anyway. ;-) Ron
[perl #29300] [PATCH] MSWin32 passing libnci.def to linker
# New Ticket Created by Ron Blaschke # Please include the string: [perl #29300] # in the subject line of all future correspondence about this issue. # URL: http://rt.perl.org:80/rt3/Ticket/Display.html?id=29300 link needs to be told that libnci.def is a module definition file, via the -def: flag. The patch changes libnci.def to -def:libnci.def. config/gen/makefiles/root.in mswin32_libnci_flag.patch Description: Binary data
[perl #29302] [PATCH] Invalid HTML doc links for Win32 Firefox
# New Ticket Created by Philip Taylor # Please include the string: [perl #29302] # in the subject line of all future correspondence about this issue. # URL: http://rt.perl.org:80/rt3/Ticket/Display.html?id=29302 On a Windows system, File::Spec returns paths with backslashes. The HTML documentation generator (write_docs.pl etc) uses these paths in the HTML code, resulting in links like a href= docs\pdds\pdd00_pdd.pod.html.../a IE handles these with no problems. Firefox (0.8) follows the link to the right place, but then refers to itself as something like file:///e:/parrot/cvs/parrot/docs/html/docs%5Cpdds%5Cpdd00_pdd.pod.html (apparently forgetting that it used to think \ was a path delimiter and now considering it part of the filename) and so all the relative links in that page, like a href=..\..\../html/index.htmlContents/a (as well as all the images and stylesheets) are incorrect. All appears to work (in IE and Firefox) after changing relative_path() in lib/Parrot/IO/Directory.pm to replace backslashes with forward-slashes before returning. (The same can be achieved by altering the two link-generating bits in lib/Parrot/Docs/Item.pm, but I have no idea whether that would be a better place to do it.) -- Philip Taylor [EMAIL PROTECTED] Index: parrot/lib/Parrot/IO/Directory.pm === RCS file: /cvs/public/parrot/lib/Parrot/IO/Directory.pm,v retrieving revision 1.9 diff -u -b -r1.9 Directory.pm --- parrot/lib/Parrot/IO/Directory.pm 27 Mar 2004 22:22:54 - 1.9 +++ parrot/lib/Parrot/IO/Directory.pm 1 May 2004 17:00:40 - @@ -161,7 +161,9 @@ $path = $path-path if ref $path; - return File::Spec-abs2rel($path, $self-path); + my $rel_path = File::Spec-abs2rel($path, $self-path); + $rel_path =~ tr~\\~/~; + return $rel_path; } =item Cparent()
Re: Outstanding parrot issues?
On 30 Apr 2004, at 12:54, Leopold Toetsch wrote: ... Would it be possible for parrot to provide an embedder's interface to all the (exported) functions that checks whether the stack top pointer is set, and if not (ie NULL) it pulls the address of a local variable in it This doesn't work: { PMC *x = pmc_new(..); { some_parrot_func(); } } Cx would be outside of the visible range of stack items. The braces do of course indicate stack frames. Since in this case I am outside or parrot and have chosen to use the interface, i better use register_pmc and if I did, then this sceme would work? Arthur
Re: Outstanding parrot issues?
On 30 Apr 2004, at 19:30, Leopold Toetsch wrote: Like it or not DOD/GC has different impacts on the embedder. Above rules are simple. There is no when the PMC isn't used any more decrement a refcount and when you do that and that then icnrement a refcount or some such like in XS. This is really simple. Simplest is to just set the top of stack. I am now going to be impolite. THERE ARE CASES WHERE YOU CAN NOT SET A TOP OF STACK, FOR EXAMPLE IF YOU ARE WRITING A PLUGIN TO A BINARY ONLY APPLICATION LIKE INTERNET EXPLORER OR WRITING AN APACHE2 SHARED LIBRARY THAT IS SUPPOSED TO WORK WITH PRE COMPILED BINARIES, NOT TO MENTION A LOT OF APPLICATIONS THAT MIGHT WANT TO EMBED PARROT AS AN OPTION MIGHT FEEL IT IS A TAD FUCKING UNCLEAN TO RUN THEIR ENTIRE APPLICATION THROUGH PARROT (THINK OPENOFFICE) I am amazed by the fact that parrot seems determined to redo the same misstakes perl5 did. Arthur
Re: Outstanding parrot issues?
Arthur Bergman wrote: I am now going to be impolite. Meh... Leo: There are some embedding applications where it's simply not possible to get the top of the stack. For example, let's say I want to write a Parrot::Interp module for Perl 5 (on a non-Ponie core): my $i=new Parrot::Interp; my $argv=$i-new_pmc('PerlArray'); $argv-push($i-new_pmc('PerlString')-set_string('foo')); $i-load_bytecode(foo.pbc); $i-run_bytecode($argv); Now, theoretically Parrot::Interp::new should capture the top of the C stack, but there's no way it could do so. If it captured an auto variable in its own body, that variable might not even be part of the stack by the time run_bytecode is invoked. Having said that, the PMC registration technique ought to be good enough for this particular application. Arthur: Embedding Parrot will never be quite as simple conceptually as embedding Perl. The garbage collection system ensures that. Even so, there does need to be a way to embed Parrot without having it take over your program--and it appears that PMC registration and other alternative methods of dealing with the GC will do that. There's no need to disable the GC outside of a runloop, and in fact I could easily imagine someone using Parrot buffers and the GC system without using the runloop itself as a convenient memory management system for an application otherwise written in straight C. (Not to mention that Parrot I/O and strings should be a lot nicer than the straight C equivalents...) Parrot must be embeddable in virtually any environment Perl can be. That doesn't mean it has to be as easy, but it has to be possible. If it isn't, we might as well give up on the embedding interface altogether. -- Brent Dax Royal-Gordon [EMAIL PROTECTED] Perl and Parrot hacker Oceania has always been at war with Eastasia.
Re: Bit ops on strings
It's been said that what the masses think of as binary data is outside the concept of a string, and this lurker just don't see that. A binary string is string over a character set of size two, just like an ASCII string is a string over a character set of size 128. [Like character strings, so-called binary data can even have different encodings besides the usual eight bits packed into a byte, e.g. Base64 or 7E1 (7 bits even parity 1 stop bit).] And shifting is not at all limited to bit strings. If I have the bit string of length 5 (1, 0, 0, 0, 1) or 17 for short, and the text string Hello then I can shift the second left by two to get llo as easily as I can shift the first left by two to get 4. (The choice of fill character is of course up for debate.) Or arithmetically shift the second right by two to form HHHel analagous to an ASR of the bit string to yield 28. And, xor, etc. are less obvious because there are multiple ways to define such operations when there are more than two truth values. But they can, at any rate, be defined: you can apply a function of two arguments element-wise to two character strings of equal length to produce a third character string of the same length. It seems right to leave these ops undefined by default for non-binary strings (since there is definitely no one right definition), but the prevailing notion that they *can't* be applied to, say, Unicode text without making a horrible mess is just wrong. Dan Sugalski [EMAIL PROTECTED] 04/30/04 10:25 PM At 7:07 PM -0700 4/30/04, Jeff Clites wrote: On Apr 30, 2004, at 10:22 AM, Dan Sugalski wrote: At 2:57 AM +1000 5/1/04, Andre Pang wrote: Of course Parrot should have a function to reinterpret something of a string type as raw binary data and vice versa, but don't mix binary data with strings: they are completely different types, and raw binary data should never be able to be put into a string register. Maybe some blurring of binary data/strings should happen at the Perl layer, but Parrot should keep them as distinct as possible, IMHO. I'm trying to make sure that keeping them separate is possible, but it's important for everyone to remember that we're limited in what we can do. Parrot *can't* dictate semantics. That's not what we get to do. But your plan seems to be very much dictating semantics--treating a whole class of reasonable string operations as in that case, punt and throw an exception. That's why it's overridable. I fully expect most languages will do so by default, but the option to leave the exceptions on as a debugging aid. And it's not clear that the semantics it is dictating in fact match any of the target languages (or in fact, any existing language at all). The at-runtime association of character set/encoding/language, and the semantics it implies, is what I'm referring to here. Yep, but with the exceptions disabled things'll act the way they should. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Bit ops on strings
On Sat, 2004-05-01 at 15:09, Jarkko Hietaniemi wrote: How are you defining valid UTF-8? Is there a codepoint in UTF-8 between \x00 and \xff that isn't valid? Is there a reason to ever do Like, half of them? \x80 .. \xff are all invalid as UTF-8. Heh, damn Ken Thompson and his placemat! I am too new to UCS and UTF-8, and had thought it was always 8-bit. I stand corrected, having read up on the UTF-8 and Unicode FAQ. Jeff, yeah I have to take back my statement. If Perl defaults to UTF-8, then it's not a valid assumption that a UTF-8 input string won't throw an exception. I still think that's ok, and better than representation-expanding to the larger representation and doing the bit-op in that, since that means that bit-vectors would have to be valid in enum_stringrep_one, _two and _four as sort of alternate datastructures. I don't think we want to go there. For everything else, as Jeff correctly points out, this has nothing to do with encoding. Only in the sense that default encoding in a language like (only one example) Perl 6 dictates what representation you will have to expect to be the common case. bitwise operations on anything other than 8-bit codepoints? I am very confused. THIS IS WHAT WE ALL SEEM TO BE SAYING. BITOPS ONLY ON EIGHT-BIT DATA. AM I WRONG? No, it's not, and could you please not get emotional about this? It's what you, Dan and I have been saying, but I was responding to Jeff who said: Just FYI, the way I implemented bitwise-not so far, was to bitwise-not code points 0x{00}-0x{FF} as uint8-sized things, 0x{100}-0x{} as uint16-sized things, and 0x{} as uint32-sized things (but then bit-masking them with 0xF to make sure that they fell into a valid code point range). It was kind of important that I deal with the fact that I was proposing a very different behavior for bit-shifting than exists currently for boolean operations, I thought. The question becomes should I CHANGE the existing bit-ops so that they don't work on representations in two or four bytes for symmetry? If this continues to be so contentious, I'm tempted to agree with the nay-sayers and say that Parrot shouldn't do bit-vectors on strings, and we should just implement a bit-vector class later on. Perl will just have to suffer the overhead of translation. This just IS NOT important enough to waste this many brain cells on. -- Aaron Sherman [EMAIL PROTECTED] Senior Systems Engineer and Toolsmith It's the sound of a satellite saying, 'get me down!' -Shriekback signature.asc Description: This is a digitally signed message part
Re: Strings Manifesto
[Finishing this discussion on p6i, since it began here.] On Apr 28, 2004, at 5:05 PM, Larry Wall wrote: On Wed, Apr 28, 2004 at 03:30:07PM -0700, Jeff Clites wrote: : Outside. Conceptually, JPEG isn't a string any more than an XML : document is an MP3. I'm not vehemently opposed to redefining the meaning of string this way, but I would like to point out that the term used to have a more general meaning. Witness terms like bit string. Good point. However, the more general usage seems to have largely fallen out of use (to the extent to which I'd forgotten about it until now). For instance, the Java String class lacks this generality. Additionally, ObjC's NSString and (from what I can tell) Python and Ruby conceive of strings as textual. [And of course, it would be permissible in terms of English usage to say that a bit string isn't a string, much like a fire house isn't a house, and a suspected criminal isn't necessarily a criminal, and melted ice isn't ice.] : Some languages make this very clear by providing a separate data type : to hold a blob of bytes. Java uses a byte[] for this (an array of : bytes), rather than a String. And Objective-C (via the Foundation : framework) has an NSData class for this (whereas strings are : represented via NSString). Another approach is to say that (in general) strings are sequences of abstract integers, and byte strings (and their ilk) impose size constraint, while text strings impose various semantic constraints. This is more in line with the historical usage of string. Yes, though I think that this diverges from current usage (in general programming contexts), and more importantly promotes the confusion that text is inherently byte-based (or even, semantically number-based). The parenthesized point there is that a representation of text a sequence of numbers is an implementation detail--it's not inherent in the notion of text. The semantics of text do not imply that it is a semantic constraint layered on top of a sequence of numbers. In the vein of the Perl philosophy of making different things look different, I think it's important to linguistically distinguish between the two. Many programming languages do that, and users of those languages suffer less confusion in this area. The key point is that text and uninterpreted byte sequences are semantically oceans apart. I'd say that as data types, byte sequences are semantically much simpler than hashes (for instance), and strings-as-text are much more complex. It makes little sense to bitwise-not text, or to uppercase bytes. : (And it implies that you can uppercase a JPEG, for instance). : Only some encodings let you get away with this--for example, not every : byte sequence is valid UTF-8, so an arbitrary byte blob likely wouldn't : decode if you tried to pretend that it was the UTF-8-encoded version of : something. The major practical downside of doing something like this is : that it leads to confusion, and propagates the viewpoint that a string : is just a blob of bytes. And the conceptual downside is that if a : string is fundamentally intended to represent textual data, then it : doesn't make much sense to use it to represent something non-textual. I think of a string as a fundamental data type that can be *used* to represent text when properly typed. But strings are more fundamental than text--you can have a string of tokens, for instance. Just because various string types were confused in the past is no reason to settle on a single string type as the only true string. If you can do it, fine, but you'll have to come up with a substitute name for the more general concept, or you're going to be fighting the culture continually from here on out. I don't like culture wars... I think the more general concept is array. The major problem with using string for the more general concept is confusion. People do tend to get really confused here. If you define string of blahs to mean sequence of blahs (to match the historical usage), that's on its face reasonable. But people jump to the conclusion that a string-as-bytes is re-interpretable as a string-as-text (and vice-versa) via something like a cast--a reinterpretation of the bytes of some in-memory representation. As a general sequence, one wouldn't be tempted to think that a string-of-quaternions was necessarily re-interpretable as a string-of-PurchaseOrders. I don't think it's culturally possible to shake this view of text-is-really-just bytes without using distinct terminology. I'm not vehemently opposed to jettisoning the word string entirely, and instead using Text and Sequence for the above concepts--that's the usual way to deal with an ambiguous term. But the downside is that it forms a learning barrier for people coming from other languages. I think that string meaning text, and array meaning general sequence would be the most consistent with current general usage. But my main concern is that we distinguish
Re: Outstanding parrot issues?
Arthur Bergman [EMAIL PROTECTED] wrote: THERE ARE CASES Arthur, please let's quietly talk about possible issues. Many libraries that you want to use, demand that you call The_lib_init(bla). This isn't inappropriate, it's a rule. (dot). Parrot is GC based. (dot). This imposes different semantics for embedders. I've listed four different very simple ways to not get your PMC collected to early. GC and refcounting are different schemes to achieve the same thing. You know that. But nethertheless you have to follow these GC specific rules. leo
Re: Outstanding parrot issues?
Brent 'Dax' Royal-Gordon [EMAIL PROTECTED] wrote: Arthur Bergman wrote: I am now going to be impolite. Meh... Leo: There are some embedding applications where it's simply not possible to get the top of the stack. Not possible, or some of ... just don't like that ;) write a Parrot::Interp module for Perl 5 (on a non-Ponie core): my $i=new Parrot::Interp; my $argv=$i-new_pmc('PerlArray'); If there is such an interface, it's responsible for anchoring the PMC. This is *one* simple rule. Shit: really. I dont't get it. Please read (again): $ perldoc perlguts /increment ... and does not increment ... do not increment the reference count ... As a side effect, it increments ... has been incremented to two. ...If it is not the same as the sv argument, the reference count of the obj object is incremented. If it is the same, or if the how argument is PERL_MAGIC_arylen, or if it is a NULL pointer, then obj is merely stored, without the reference count being incremented. *That could make me cry* Think different, have fun, leo
Re: Outstanding parrot issues?
On 2 May 2004, at 00:20, Leopold Toetsch wrote: Arthur Bergman [EMAIL PROTECTED] wrote: THERE ARE CASES Arthur, please let's quietly talk about possible issues. Many libraries that you want to use, demand that you call The_lib_init(bla). This isn't inappropriate, it's a rule. (dot). Parrot is GC based. (dot). Yes, but they don't demand that at the top level, by demanding that at a top level you cut out all non opensource applications with a plugin based API, if this is your goal then I am going to stop playing right now. This imposes different semantics for embedders. I've listed four different very simple ways to not get your PMC collected to early. GC and refcounting are different schemes to achieve the same thing. You know that. But nethertheless you have to follow these GC specific rules. Leo, I am not an idiot, please do not treat me like one. I fail to see how the register/unregister PMC issue is semantically different from a reference count. All I want to do is. 1) create a parrot interpreter 2) create some pmcs 3) call some code inside parrot with those pmcs Now I am fine registering those PMCs that I create and unregister them afterwards, but inside the call to parrot everything should behave as normal! Currently there is no easy way to do this. The obvious answer seems to be to have the embedding interface set the top of stack in each embedding function if it is not set. This would do the right thing and make it easy to embed parrot. Arthur