jeff p wrote: > Hello, > >> Yet, I'm a bit astonished. I thought that when compiling with -O2, >> cosmetic changes should become negligible. Perhaps the strict foldl' has >> an effect? >> > Perhaps... but I doubt that is the main reason. At the moment I have > no idea why there is such a discrepancy between the heap usages... > > A big part of why the solutions you crafted work so efficiently is > that they take advantage of the fact that the rows will be written out > exactly as they are read in. I wanted to see if a more general code > could maintain the same efficiency. Here is some code to read in a > file, write out a file, and do selections-- the idea is that CSV files > are internally represented and manipulated as [[ByteString]]. > > readCSV file = do > v <- B.readFile file > return $ map (B.split ',') $ B.lines v
Good, writeCSV writes out every row immediately after it got it. I eliminated (++ [nl]) in the hope of reducing the constant factor slightly. Using difference lists for that is nicer but here you go. > writeCSV file tbl = do > h <- openFile file WriteMode > mapM_ (writeRow h) tbl > hClose h > where > comma = B.singleton ',' > nl = B.singleton '\n' > whriteRow h row = > mapM_ (B.hPut h) (intersperse comma row) >> B.hPut h nl Concerning select, one myFilter can be fused away and there is the "transpose trick" for filtering out the columns: columns get filtered once and for all and (map (`elem` tags)) only needs to be computed once. I don't know why the MonadReader is necessary, so I removed it > select targs test (cols : rows) = cols : filterCols (filterRows rows) > where > filterRows = filter (test cols) > myFilter = map snd . filter fst > filterCols = transpose . myFilter . zip colflags . transpose > colflags = map (`elem` tags) cols Concerning col, one should share the index i across different rows. The compiler is likely not to do a full laziness transformation as this bears the danger of introducing space leaks (out of the coder's control, that is). > col x cols = \row -> row !! i > where > Just i = lookup (B.pack x) $ zip cols [0..] A possible test is then something like > test = (== B.pack "test") . col "COL" > This code runs reasonably fast-- around 13 seconds to read in a 120MB > file (~750000 rows), select half the columns of around 22000 rows > randomly distributed throughout the input table, and write a new CSV > file. It takes around 90 seconds to just remove some columns from > every row in the table and write a new file. So the slow part of the > program is probably the writeCSV function. Do you think these times > can be improved upon? I hope so... Though the 13 seconds are disproportionally high (only 22000 rows to be written) compared to 90 seconds (750000 rows to be written). Regards, apfelmus _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe