On Tue, Mar 23, 2010 at 08:51:16AM -0700, John Millikin wrote: > On Tue, Mar 23, 2010 at 00:27, Johann Höchtl <johann.hoec...@gmail.com> wrote: > > How are ByteStrings (Lazy, UTF8) and Data.Text meant to co-exist? When I > > read bytestrings over a socket which happens to be UTF16-LE encoded and > > identify a fitting function in Data.Text, I guess I have to transcode them > > with Data.Text.Encoding to make the type System happy? > > > There's no such thing as a UTF8 or UTF16 bytestring -- a bytestring is > just a more efficient encoding of [Word8], just as Text is a more > efficient encoding of [Char]. If the file format you're parsing > specifies that some series of bytes is text encoded as UTF16-LE, then > you can use the Text decoders to convert to Text. > > Poor separation between bytes and characters has caused problems in > many major languages (C, C++, PHP, Ruby, Python) -- lets not abandon > the advantages of correctness to chase a few percentage points of > performance.
I agree with the principle of correctness, but let's be honest - it's (many) orders of magnitude between ByteString and String and Text, not just a few percentage points… I've been struggling with this problem too and it's not nice. Every time one uses the system readFile & friends (anything that doesn't read via ByteStrings), it hell slow. Test: read a file and compute its size in chars. Input text file is ~40MB in size, has one non-ASCII char. The test might seem stupid but it is a simple one. ghc 6.12.1. Data.ByteString.Lazy (bytestring readFile + length) - < 10 miliseconds, incorrect length (as expected). Data.ByteString.Lazy.UTF8 (system readFile + fromString + length) - 11 seconds, correct length. Data.Text.Lazy (system readFile + pack + length) - 26s, correct length. String (system readfile + length) - ~1 second, correct length. For the record: python2.6 (str type) - ~60ms, incorrect length. python3.1 (unicode) - ~310ms, correct length. If anyone has a solution on how to work on fast text (unicode) transformations (but not a 1:1 pipeline where fusion can work nicely), I'd be glad to hear. iustin _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe