On Mon, Jan 21, 2002 at 05:09:06PM +0000, Dave Mitchell wrote: > Jarkko Hietaniemi <[EMAIL PROTECTED]> wrote: > > > In the good ol'days, one could usefully use regexes on 8-bit binary data, > > > eg > > > > > > open G, 'myfile.gif' or die; > > > read G, $buf, 8192 or die; > > > if ($buf =~ /^GIF89a\x08\x02/) { > > > ..... > > > > > > where it was clear to everyone that we are checking whether the first few > > > bytes of the file contain (0x47, 0x49, ..., 0x02) > > > > > > Is this sort of thing now completely dead in the Brave New World of > > > > Of course not, I do not remember forbiddding \xHH. The default of > > data coming in from filehandles could still be opaque 8-bit bytes. > > Good :-) > > I'm not clear though, how binary data could get passed to parrot's > regex engine, unless there's a BINARY_8 CEF in addition to > UNICODE_CEF_UTF_8 etc in C<typedef enum {...} PARROT_CEF>
Yes, that's somewhat problematic. Making up "a byte CEF" would be Wrong, though, because there is, by definition, no CCS to map, and we would be dangerously close to conflating in CES, too... ACR-CCS-CEF-CES. Read the character model. Understand the character model. Embrace the character model. Be the character model. (And once you're it, read the relevant Unicode, XML, and Web standards.) To highlight the difference between opaque numbers and characters, the above should really be: if ($buf =~ /\x47\x49\x46\x38\x39\x61\x08\x02/) { ... } I think what needs to be done is that \xHH must not be encoded as literals (as it is now, 'A' and \x41 are identical (in ASCII)), but instead as regex nodes of their own, storing the code points. Then the regex engine can try both the "right/new way" (the Unicode code point), and the "wrong/legacy way" (the native code point). String literals have the same problem. What does "foo\x41" mean? (Here, unlike with the regular expressions, we can't "try both", unless we integrate Damian's quantum state variables to the core :-) We have various options: there might be a pragma to tell what CCS "naked codepoints" are to be understood in, or the default could be grovelled out of environment settings (both these options could affect the regex solution, too), and so forth. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen