> : Hello everybody, > : > : I'm about to learn myself perl6 (after using perl5 for some time). > > I'm also trying to learn perl6 after using perl5 for some time. :-) I wouldn't even try to compare you and me .... :-)
> Pretty close. The way it's set up currently, $len is a reference > to a variable external to the rule, so $len is likely to fail under > stricture unless you've declared "my $len" somewhere. To make the > variable automatically scope to the rule, you have to use $?len > these days. ok. > : And furthermore is perl6 said to be unicode-ready. > : So I put the :u0-modifier in the data-regex; will that DWIM if I try to > : match a unicode-string with that rule? > > It should. However (and this is a really big however), you'll have > to be very careful that something earlier hasn't converted one form > of Unicode to another on you. For instance, if your string came in > as UTF-8, and your I/O layer translated it internally to UTF-32 or > some such, you're just completely hosed. When you're working at the > bytes level, you must know the encoding of your string. > > So the natural reaction is to open your I/O handle :raw to get binary > data into your string. Then you try to match Unicode graphemes with [ > :u2 . ] and discover that *that* doesn't work. Which is obvious when > you consider that Perl has no way of knowing which Unicode encoding > the binary data is in, so it's gonna consider it to be something like > Latin-1 unless you tell it otherwise. So you'll probably have to > cast the binary string to whatever its actual encoding is (potentially > lying about the binary parts, which we may or may not get away with, > depending on who validates the string when), or maybe we just need > to define rules like <utf16be_codepoint> and <utf8_grapheme> for use > under the :u0 regime. Of course the file must be opened in binary mode - else the line-endings etc. can be destroyed in the binary data, which is bad. So Perl/Parrot can't autodetect the kind of encoding. But maybe it should be possible to do something like [:utf16be_codepoint]? Len: $?len:=(\d+) \n $?data:=([:raw .]<$len>) \n ie. say that the conversion to unicode is optional?? > : Is anything known about the internals of pattern matching whether the > : hypothetical variables will consume (double) space? > : I'm asking because I imagine getting a tag like "Len: 200000000" and then > : having problems with 256MB RAM. Matching shouldn't be a problem according > : to apo 5 (see the chapter "RFC 093: Regex: Support for incremental > : pattern matching") but I'll maybe have troubles using the matched data? > > My understanding is that Parrot implements copy-on-write, so you should > be okay there. ok, thank you. > Even the late ones? :-) even them - this is the *only* answer I received. Again: > : Thank you for all answers! > Larry Phil