Re: question regarding rules and bytes vs characters

Larry Wall Mon, 12 Jul 2004 11:46:02 -0700

On Mon, Jul 12, 2004 at 07:42:02AM +0200, Ph. Marek wrote:
: Of course the file must be opened in binary mode - else the line-endings etc. 
: can be destroyed in the binary data, which is bad.
: 
: So Perl/Parrot can't autodetect the kind of encoding.
: But maybe it should be possible to do something like
:       [:utf16be_codepoint]? Len: $?len:=(\d+) \n
:         $?data:=([:raw .]<$len>) \n
: ie. say that the conversion to unicode is optional??


Yes, that's probably better than forcing an official encoding on something
that doesn't have a consistent encoding.  Though I don't believe you want
the square brackets there.  Something more like:

        :utf16be_codepoint Len: $?len:=(\d+) \n
            $?data:=(:byte . <$len>) \n

(since the :byte will scope to the capturing parens, and I imagine
$?len ends up being immediately typed as Unicode such that it can be
used as a number even in a :byte context.)

Or if you want the brackets for clarity:

        [:utf16be_codepoint
            Len: $?len:=(\d+) \n
            $?data:=([:byte .] <$len>) \n
        ]

Probably :utf16be_codepoint wants to be written :utf16be:codepoint
or some such, since the encoding and the unicode abstraction level
are (mostly) orthogonal.  Or maybe it's :code("utf16be").  Or even
better, maybe the encoding is an optional named parameter, as in
:code(:utf16be), where :code by itself defaults to :code(:utf8).  That
extends nicely to things like :graph(:utf32) and :lang("de",:scsu),
where :lang requires the language to be specified, but can default
the encoding to something reasonable.

Hmm, maybe that means that language-dependent graphemes are called
"langs", which I suppose is short for "langemes".

I suppose that :byte could also take an argument to force a particular
old-style (single-byte) locale, if we choose to support them, and are
willing to take the consequences of Jarkko going postal.  :-)

Larry

Re: question regarding rules and bytes vs characters

Reply via email to