On Tue, 2009-02-03 at 11:03 -0600, John Goerzen wrote: > Will there also be something to handle the UTF-16 BOM marker? I'm not > sure what the best API for that is, since it may or may not be present, > but it should be considered -- and could perhaps help autodetect encoding.
I think someone else mentioned this already, but utf16 (as opposed to utf16be/le) will use the BOM if its present. I'm not quite sure what happens when you switch encoding, presumably it'll accept and consider a BOM at that point. > > Thanks to suggestions from Duncan Coutts, it's possible to call > > hSetEncoding even on buffered read Handles, and the right thing > > happens. So we can read from text streams that include multiple > > encodings, such as an HTTP response or email message, without having > > to turn buffering off (though there is a penalty for switching > > encodings on a buffered Handle, as the IO system has to do some > > re-decoding to figure out where it should start reading from again). > > Sounds useful, but is this the bit that causes the 30% performance hit? No. You only pay that penalty if you switch encoding. The standard case has no extra cost. > > Performance is about 30% slower on "hGetContents >>= putStr" than > > before. I've profiled it, and about 25% of this is in doing the > > actual encoding/decoding, the rest is accounted for by the fact that > > we're shuffling around 32-bit chars rather than bytes in the Handle > > buffer, so there's not much we can do to improve this. > > Does this mean that if we set the encoding to latin1, tat we should see > performance 5% worse than present? No, I think that's 30% for latin1. The cost is not really the character conversion but the copying from a byte buffer via iconv to a char buffer. > 30% slower is a big deal, especially since we're not all that speedy now. Bear in mind that's talking about the [Char] interface, and nobody using that is expecting great performance. We already have an API for getting big chunks of bytes out of a Handle, with the new Handle we'll also want something equivalent for a packed text representation. Hopefully we can get something nice with the new text package. Duncan _______________________________________________ Glasgow-haskell-users mailing list Glasgow-haskell-users@haskell.org http://www.haskell.org/mailman/listinfo/glasgow-haskell-users