<[EMAIL PROTECTED]> writes: >On Wed, Aug 27, 2003 at 06:04:48PM +0200, Guido Flohr wrote: >> Hi, >> >> [EMAIL PROTECTED] wrote: >> >I'm working with a byte oriented protocol, and need to extract byte n1 >> >through >> >byte n2 from a string.
No problem (honest;-)) (At least in perl5.8 ...) A byte is a number between 0..255 We can represent that as a character with ordinal value < 256. So your sequence of bytes maps exactly to a sequence of characters. So you can take your bytes and but them in a string and then use substr() etc. on just as you always could in traditional perl (and other languages). Where the snags could creep in is if other parts of your application are dealing with Characters in their Wider meaning. If that is the case you must make sure they get "encoded" into a byte stream before your protocol gets to see them. That is what Encode module and MIME::Base64 etc. are for. >> >> I read this as "*character* n1 through *character* n2", right? > >Alas, no -- I'm interested in byte n1 through byte n2. This is because the >protocol I am working with uses byte offsets. substr() works like a charm as >long as 1 char = 1 byte, but in utf8 it breaks down. No it doesn't - so long as you don't tell perl there are UTF-8 encoded characters in there then it will not notice. IO still defaults to reading bytes. You can tell perl that those bytes represent encoded characters (either as UTF-8 or in your current locale's encoding) but you don't have to. > >So given a string of utf8 data $x I want to be able to extract bytes 3 - 12 from >it...not characters :( So my $string = "Any \x{xxxx} etc."; my $bytes = encode("UTF-8", $string); # output is bounded 0..255 my $field = substr($bytes,3,9); (Now back in perl5.6 we had not got this thought through and there was all kinds of weird "use bytes" and "no utf8" confusion in the descriptions.) Note the above assumes that something is working with $string as characters. Just messing with an all-bytes protocol is even simpler - perl does not need to know that (some of) those bytes are UTF-8 for characters. That is you only need to get Encode involved when you need to mix bytes-for-protocol with payload-is-characters. > >//Ed