Sounds bad. But it'll need someone with bytestring expertise to debug. Maybe there's a GHC problem underlying; or maybe it's shortcoming of bytestring.
Simon | -----Original Message----- | From: Glasgow-haskell-users [mailto:glasgow-haskell-users- | [email protected]] On Behalf Of Dominic Steinitz | Sent: 25 August 2016 10:11 | To: GHC users <[email protected]> | Subject: GHC Performance / Replacement for R? | | I am trying to use Haskell as a replacement for R but running into two | problems which I describe below. Are there any plans to address the | performance issues I have encountered? | | 1. I seem to have to jump through a lot of hoops just to be able to | select the data I am interested in. | | {-# LANGUAGE ScopedTypeVariables #-} | | {-# OPTIONS_GHC -Wall #-} | | import Data.Csv hiding ( decodeByName ) | import qualified Data.Vector as V | | import Data.ByteString ( ByteString ) | import qualified Data.ByteString.Char8 as B | | import qualified Pipes.Prelude as P | import qualified Pipes.ByteString as Bytes import Pipes import | qualified Pipes.Csv as Csv import System.IO | | import qualified Control.Foldl as L | | main :: IO () | main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do | let csvs :: Producer (V.Vector ByteString) IO () | csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat | uvectors :: Producer (V.Vector ByteString) IO () | uvectors = csvs >-> P.map (V.foldr V.cons V.empty) | vec_vec <- L.impurely P.foldM L.vector uvectors | print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17 | print $ V.length vec_vec | let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec | print $ V.length rockspring | | Here's the equivalent R: | | df <- read.csv("787338586_T_ONTIME.csv") | rockspring <- df[df$ORIGIN == "RKS",] | | 2. Now I think I could improve the above to make an environment that | is more similar to the one my colleagues are used to in R but more | problematical is the memory usage. | | * 112.5M file | * Just loading the source into ghci takes 142.7M | * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo | takes me up to 4.75G. But we probably don't want to do this! | * Let's try again. | * > :set -XScopedTypeVariables | * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode | * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode | HasHeader (Bytes.fromHandle h) >-> P.concat | * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs >-> | P.map (V.map id) >-> P.map (V.foldr V.cons V.empty) | * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM | L.vector uvectors | * Now I am up at 3.17G. In R I am under 221.3M. | * > V.length rockspring takes a long time to return 155 and now I am | at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems | instantaneous and now uses only 379.5M. | * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are | 37 columns and 471,949 rows. | | Running this as an executable gives | | ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01- | 01","EV","20366","N904EV","2512","10747","1074702","30747", | "BRO","Brownsville, TX","Texas","11298","1129803","30194", | "DFW","Dallas/Fort Worth, TX","Texas","0720","0718", | "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00", | "","0.00","482.00","","","","","",""] | 471949 | 155 | 14,179,764,240 bytes allocated in the heap | 3,378,342,072 bytes copied during GC | 786,333,512 bytes maximum residency (13 sample(s)) | 36,933,976 bytes maximum slop | 1434 MB total memory in use (0 MB lost due to | fragmentation) | | Tot time (elapsed) Avg pause | Max pause | Gen 0 26989 colls, 0 par 1.423s 1.483s 0.0001s | 0.0039s | Gen 1 13 colls, 0 par 1.005s 1.499s 0.1153s | 0.6730s | | INIT time 0.000s ( 0.003s elapsed) | MUT time 3.195s ( 3.193s elapsed) | GC time 2.428s ( 2.982s elapsed) | EXIT time 0.016s ( 0.138s elapsed) | Total time 5.642s ( 6.315s elapsed) | | %GC time 43.0% (47.2% elapsed) | | Alloc rate 4,437,740,019 bytes per MUT second | | Productivity 57.0% of total user, 50.9% of total elapsed | | _______________________________________________ | Glasgow-haskell-users mailing list | [email protected] | https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h | askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell- | users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d | 3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5 | NSOJREZrt2Fsqhi63iJOpxmku68E%3d _______________________________________________ Glasgow-haskell-users mailing list [email protected] http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
