Re: question regarding rules and bytes vs characters

Ph. Marek Sun, 11 Jul 2004 23:03:46 -0700

> : Hello everybody,
> :
> : I'm about to learn myself perl6 (after using perl5 for some time).
>
> I'm also trying to learn perl6 after using perl5 for some time.  :-)
I wouldn't even try to compare you and me .... :-)


> Pretty close.  The way it's set up currently, $len is a reference
> to a variable external to the rule, so $len is likely to fail under
> stricture unless you've declared "my $len" somewhere.  To make the
> variable automatically scope to the rule, you have to use $?len
> these days.
ok.

> : And furthermore is perl6 said to be unicode-ready.
> : So I put the :u0-modifier in the data-regex; will that DWIM if I try to
> : match a unicode-string with that rule?
>
> It should.  However (and this is a really big however), you'll have
> to be very careful that something earlier hasn't converted one form
> of Unicode to another on you.  For instance, if your string came in
> as UTF-8, and your I/O layer translated it internally to UTF-32 or
> some such, you're just completely hosed.  When you're working at the
> bytes level, you must know the encoding of your string.
>
> So the natural reaction is to open your I/O handle :raw to get binary
> data into your string.  Then you try to match Unicode graphemes with [
> :u2 . ] and discover that *that* doesn't work.  Which is obvious when
> you consider that Perl has no way of knowing which Unicode encoding
> the binary data is in, so it's gonna consider it to be something like
> Latin-1 unless you tell it otherwise.  So you'll probably have to
> cast the binary string to whatever its actual encoding is (potentially
> lying about the binary parts, which we may or may not get away with,
> depending on who validates the string when), or maybe we just need
> to define rules like <utf16be_codepoint> and <utf8_grapheme> for use
> under the :u0 regime.
Of course the file must be opened in binary mode - else the line-endings etc. 
can be destroyed in the binary data, which is bad.

So Perl/Parrot can't autodetect the kind of encoding.
But maybe it should be possible to do something like
        [:utf16be_codepoint]? Len: $?len:=(\d+) \n
        $?data:=([:raw .]<$len>) \n
ie. say that the conversion to unicode is optional??

> : Is anything known about the internals of pattern matching whether the
> : hypothetical variables will consume (double) space?
> : I'm asking because I imagine getting a tag like "Len: 200000000" and then
> : having problems with 256MB RAM. Matching shouldn't be a problem according
> : to apo 5 (see the chapter "RFC 093: Regex: Support for incremental
> : pattern matching") but I'll maybe have troubles using the matched data?
>
> My understanding is that Parrot implements copy-on-write, so you should
> be okay there.
ok, thank you.

> Even the late ones?  :-)
even them - this is the *only* answer I received.

Again:
> : Thank you for all answers!

> Larry
Phil

Re: question regarding rules and bytes vs characters

Reply via email to