On Mon, Jan 21, 2002 at 05:09:06PM +0000, Dave Mitchell wrote:
> Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote:
> > > In the good ol'days, one could usefully use regexes on 8-bit binary data,
> > > eg
> > > 
> > > open G, 'myfile.gif' or die;
> > > read G, $buf, 8192 or die;
> > > if ($buf =~ /^GIF89a\x08\x02/) {
> > >     .....
> > > 
> > > where it was clear to everyone that we are checking whether the first few
> > > bytes of the file contain (0x47, 0x49, ..., 0x02)
> > > 
> > > Is this sort of thing now completely dead in the Brave New World of
> > 
> > Of course not, I do not remember forbiddding \xHH.  The default of
> > data coming in from filehandles could still be opaque 8-bit bytes.
> 
> Good :-)
> 
> I'm not clear though, how binary data could get passed to parrot's
> regex engine, unless there's a BINARY_8 CEF in addition to
> UNICODE_CEF_UTF_8 etc in C<typedef enum {...} PARROT_CEF>

Yes, that's somewhat problematic.  Making up "a byte CEF" would be
Wrong, though, because there is, by definition, no CCS to map, and
we would be dangerously close to conflating in CES, too...
ACR-CCS-CEF-CES.  Read the character model.  Understand the character
model.  Embrace the character model.  Be the character model.  (And
once you're it, read the relevant Unicode, XML, and Web standards.)

To highlight the difference between opaque numbers and characters,
the above should really be:

        if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... }

I think what needs to be done is that \xHH must not be encoded as
literals (as it is now, 'A' and \x41 are identical (in ASCII)), but
instead as regex nodes of their own, storing the code points.  Then
the regex engine can try both the "right/new way" (the Unicode code
point), and the "wrong/legacy way" (the native code point).

String literals have the same problem.  What does "foo\x41" mean?
(Here, unlike with the regular expressions, we can't "try both",
unless we integrate Damian's quantum state variables to the core :-)
We have various options: there might be a pragma to tell what CCS
"naked codepoints" are to be understood in, or the default could be
grovelled out of environment settings (both these options could affect
the regex solution, too), and so forth.

-- 
$jhi++; # http://www.iki.fi/jhi/
        # There is this special biologist word we use for 'stable'.
        # It is 'dead'. -- Jack Cohen

Reply via email to