GHC Performance / Replacement for R?

2016-08-25 Thread Dominic Steinitz
I am trying to use Haskell as a replacement for R but running into two
problems which I describe below. Are there any plans to address the
performance issues I have encountered?

 1. I seem to have to jump through a lot of hoops just to be able to
select the data I am interested in.

{-# LANGUAGE ScopedTypeVariables #-}

{-# OPTIONS_GHC -Wall #-}

import Data.Csv hiding ( decodeByName )
import qualified Data.Vector as V

import Data.ByteString ( ByteString )
import qualified Data.ByteString.Char8 as B

import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as Bytes
import Pipes
import qualified Pipes.Csv as Csv
import System.IO

import qualified Control.Foldl as L

main :: IO ()
main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
  let csvs :: Producer (V.Vector ByteString) IO ()
  csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
  uvectors :: Producer (V.Vector ByteString) IO ()
  uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
  vec_vec <- L.impurely P.foldM  L.vector uvectors
  print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
  print $ V.length vec_vec
  let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
  print $ V.length rockspring

Here's the equivalent R:

df <- read.csv("787338586_T_ONTIME.csv")
rockspring <- df[df$ORIGIN == "RKS",]

 2. Now I think I could improve the above to make an environment that
is more similar to the one my colleagues are used to in R but more
problematical is the memory usage.

 * 112.5M file
 * Just loading the source into ghci takes 142.7M
 * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
   takes me up to 4.75G. But we probably don't want to do this!
 * Let's try again.
 * > :set -XScopedTypeVariables
 * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
 * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode HasHeader 
(Bytes.fromHandle h) >-> P.concat
 * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >-> P.map 
(V.map id) >-> P.map (V.foldr V.cons V.empty)
 * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM  L.vector 
uvectors
 * Now I am up at 3.17G. In R I am under 221.3M.
 * > V.length rockspring takes a long time to return 155 and now I am
   at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
   instantaneous and now uses only 379.5M.
 * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
   37 columns and 471,949 rows.

Running this as an executable gives

~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s
["2014-01-01","EV","20366","N904EV","2512","10747","1074702","30747",
 "BRO","Brownsville, TX","Texas","11298","1129803","30194",
  "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
  "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
  "","0.00","482.00","","","","","",""]
471949
155
  14,179,764,240 bytes allocated in the heap
   3,378,342,072 bytes copied during GC
 786,333,512 bytes maximum residency (13 sample(s))
  36,933,976 bytes maximum slop
1434 MB total memory in use (0 MB lost due to fragmentation)

 Tot time (elapsed)  Avg pause  Max pause
  Gen  0 26989 colls, 0 par1.423s   1.483s 0.0001s0.0039s
  Gen  113 colls, 0 par1.005s   1.499s 0.1153s0.6730s

  INITtime0.000s  (  0.003s elapsed)
  MUT time3.195s  (  3.193s elapsed)
  GC  time2.428s  (  2.982s elapsed)
  EXITtime0.016s  (  0.138s elapsed)
  Total   time5.642s  (  6.315s elapsed)

  %GC time  43.0%  (47.2% elapsed)

  Alloc rate4,437,740,019 bytes per MUT second

  Productivity  57.0% of total user, 50.9% of total elapsed

___
Glasgow-haskell-users mailing list
Glasgow-haskell-users@haskell.org
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users


RE: GHC Performance / Replacement for R?

2016-08-25 Thread Simon Peyton Jones via Glasgow-haskell-users
Sounds bad.  But it'll need someone with bytestring expertise to debug.  Maybe 
there's a GHC problem underlying; or maybe it's shortcoming of bytestring.  

Simon

|  -Original Message-
|  From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
|  boun...@haskell.org] On Behalf Of Dominic Steinitz
|  Sent: 25 August 2016 10:11
|  To: GHC users 
|  Subject: GHC Performance / Replacement for R?
|  
|  I am trying to use Haskell as a replacement for R but running into two
|  problems which I describe below. Are there any plans to address the
|  performance issues I have encountered?
|  
|   1. I seem to have to jump through a lot of hoops just to be able to
|  select the data I am interested in.
|  
|  {-# LANGUAGE ScopedTypeVariables #-}
|  
|  {-# OPTIONS_GHC -Wall #-}
|  
|  import Data.Csv hiding ( decodeByName )
|  import qualified Data.Vector as V
|  
|  import Data.ByteString ( ByteString )
|  import qualified Data.ByteString.Char8 as B
|  
|  import qualified Pipes.Prelude as P
|  import qualified Pipes.ByteString as Bytes import Pipes import
|  qualified Pipes.Csv as Csv import System.IO
|  
|  import qualified Control.Foldl as L
|  
|  main :: IO ()
|  main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
|let csvs :: Producer (V.Vector ByteString) IO ()
|csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
|uvectors :: Producer (V.Vector ByteString) IO ()
|uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
|vec_vec <- L.impurely P.foldM  L.vector uvectors
|print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
|print $ V.length vec_vec
|let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
|print $ V.length rockspring
|  
|  Here's the equivalent R:
|  
|  df <- read.csv("787338586_T_ONTIME.csv")
|  rockspring <- df[df$ORIGIN == "RKS",]
|  
|   2. Now I think I could improve the above to make an environment that
|  is more similar to the one my colleagues are used to in R but more
|  problematical is the memory usage.
|  
|   * 112.5M file
|   * Just loading the source into ghci takes 142.7M
|   * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
| takes me up to 4.75G. But we probably don't want to do this!
|   * Let's try again.
|   * > :set -XScopedTypeVariables
|   * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
|   * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
|  HasHeader (Bytes.fromHandle h) >-> P.concat
|   * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >->
|  P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
|   * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
|  L.vector uvectors
|   * Now I am up at 3.17G. In R I am under 221.3M.
|   * > V.length rockspring takes a long time to return 155 and now I am
| at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
| instantaneous and now uses only 379.5M.
|   * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
| 37 columns and 471,949 rows.
|  
|  Running this as an executable gives
|  
|  ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
|  01","EV","20366","N904EV","2512","10747","1074702","30747",
|   "BRO","Brownsville, TX","Texas","11298","1129803","30194",
|"DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
|"-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
|"","0.00","482.00","","","","","",""]
|  471949
|  155
|14,179,764,240 bytes allocated in the heap
| 3,378,342,072 bytes copied during GC
|   786,333,512 bytes maximum residency (13 sample(s))
|36,933,976 bytes maximum slop
|  1434 MB total memory in use (0 MB lost due to
|  fragmentation)
|  
|   Tot time (elapsed)  Avg pause
|  Max pause
|Gen  0 26989 colls, 0 par1.423s   1.483s 0.0001s
|  0.0039s
|Gen  113 colls, 0 par1.005s   1.499s 0.1153s
|  0.6730s
|  
|INITtime0.000s  (  0.003s elapsed)
|MUT time3.195s  (  3.193s elapsed)
|GC  time2.428s  (  2.982s elapsed)
|EXITtime0.016s  (  0.138s elapsed)
|Total   time5.642s  (  6.315s elapsed)
|  
|%GC time  43.0%  (47.2% elapsed)
|  
|Alloc rate4,437,740,019 bytes per MUT second
|  
|Productivity  57.0% of total user, 50.9% of total elapsed
|  
|  ___
|  Glasgow-haskell-users mailing list
|  Glasgow-haskell-users@haskell.org
|  https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
|  askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
|  users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
|  3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
|  NSOJREZrt2Fsqhi63iJOpxmku68E%3d
___
Glasgow-haskell-users mailing list
Gl