On Mon, Jul 12, 2004 at 07:42:02AM +0200, Ph. Marek wrote: : Of course the file must be opened in binary mode - else the line-endings etc. : can be destroyed in the binary data, which is bad. : : So Perl/Parrot can't autodetect the kind of encoding. : But maybe it should be possible to do something like : [:utf16be_codepoint]? Len: $?len:=(\d+) \n : $?data:=([:raw .]<$len>) \n : ie. say that the conversion to unicode is optional??
Yes, that's probably better than forcing an official encoding on something that doesn't have a consistent encoding. Though I don't believe you want the square brackets there. Something more like: :utf16be_codepoint Len: $?len:=(\d+) \n $?data:=(:byte . <$len>) \n (since the :byte will scope to the capturing parens, and I imagine $?len ends up being immediately typed as Unicode such that it can be used as a number even in a :byte context.) Or if you want the brackets for clarity: [:utf16be_codepoint Len: $?len:=(\d+) \n $?data:=([:byte .] <$len>) \n ] Probably :utf16be_codepoint wants to be written :utf16be:codepoint or some such, since the encoding and the unicode abstraction level are (mostly) orthogonal. Or maybe it's :code("utf16be"). Or even better, maybe the encoding is an optional named parameter, as in :code(:utf16be), where :code by itself defaults to :code(:utf8). That extends nicely to things like :graph(:utf32) and :lang("de",:scsu), where :lang requires the language to be specified, but can default the encoding to something reasonable. Hmm, maybe that means that language-dependent graphemes are called "langs", which I suppose is short for "langemes". I suppose that :byte could also take an argument to force a particular old-style (single-byte) locale, if we choose to support them, and are willing to take the consequences of Jarkko going postal. :-) Larry