Iustin Pop <iu...@k1024.org> writes: > On Tue, Mar 23, 2010 at 03:31:33PM -0400, Nick Bowler wrote: > > > So that's a 30% reduction in throughput. I'd say that's a lot worse > > than a few percentage points, but certainly not orders of magnitude. > > Because you're possibly benchmarking the disk also. With a 400MB > file on tmpfs, lazy bytestring readfile + length takes on my machine > ~150ms, which is way faster than 8 seconds…
If you read the source code, length do not read the data, that's why it is so fast. It cannot be done for UTF-8 strings. >From Data.ByteString.Lazy: -- | /O(n\/c)/ 'length' returns the length of a ByteString as an -- | 'Int64' length :: ByteString -> Int64 length cs = foldlChunks (\n c -> n + fromIntegral (S.length c)) 0 cs {-# INLINE length #-} > > On the other hand, using Data.ByteString.Lazy.readFile and > > Data.ByteString.Lazy.UTF8.length, we get results of around 12000ms with > > approximately 5% of that time spent in GC, which is rather worse than > > the Prelude. Data.Text.Lazy.IO.readFile and Data.Text.Lazy.length are > > even worse, with results of around 25 *seconds* (!!) and 2% of that time > > spent in GC. > > > > GNU wc computes the correct answer as quickly as lazy bytestrings > > compute the wrong answer. With perl 5.8, slurping the entire file as > > UTF-8 computes the correct answer just as slowly as Prelude. In my > > first ever Python program (with python 2.6), I tried to read the entire > > file as a unicode string and it quickly crashes due to running out of > > memory (yikes!), so it earns a DNF. > > > > So, for computing the right answer with this simple test, it looks like > > the Prelude is the best option. We tie with Perl and lose only to GNU > > wc (which is written in C). Really, though, it would be nice to close > > that gap. > > Totally agreed :) texi...@flyeeeng:~$ ./wc-utf8 /dev/shm/haskell-utf8.txt Normal String + System.IO "60452700": 5.575169s Data.ByteString.Lazy "61965200": 0.088136s Data.ByteString.Lazy.UTF8 "60452700": 13.953714s Cheating a little bit "60452700": 9.307322s Data.Text.Lazy "60452700": 15.608354s texi...@flyeeeng:~$ time wc /dev/shm/haskell-utf8.txt 1329900 8065200 61965200 /dev/shm/haskell-utf8.txt real 0m9.303s user 0m9.089s sys 0m0.152s texi...@flyeeeng:~$ Hey, normal string way faster than GNU wc! Cheat sheet, using Data.ByteString.Lazy: myLength :: U.ByteString -> Int myLength b = loop 0 b where loop n xs = case readChar xs of Just m -> let n' = n+1 in n' `seq` loop n' (L.drop m xs) Nothing -> n readChar :: L.ByteString -> Maybe Int64 readChar bs = do (c,_) <- L.uncons bs return (choose (fromEnum c)) where choose :: Int -> Int64 choose c | c < 0xc0 = 1 | c < 0xe0 = 2 | c < 0xf0 = 3 | c < 0xf8 = 4 | otherwise = 1 inspired by Data.ByteString.Lazy.UTF8, same performances as GNU wc (it is cheating because it do not check the validity of the multibyte char). Using Debian testing, ghc 6.12.1 on Atom N270 @ 1.6GHz. The file is a repeated LaTeX UTF8 file of about 60MB. -- Guillaume Pinot http://www.irccyn.ec-nantes.fr/~pinot/ « Les grandes personnes ne comprennent jamais rien toutes seules, et c'est fatigant, pour les enfants, de toujours leur donner des explications... » -- Antoine de Saint-Exupéry, Le Petit Prince () ASCII ribbon campaign -- Against HTML e-mail /\ http://www.asciiribbon.org -- Against proprietary attachments _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe