I need to remove newlines from csv files (within columns, not at the end of entire lines). This is prior to importing into a database and was being done at my workplace by a java class for quite a while until the files processed got bigger and it proved to be too slow. (The files are up to ~250MB at the moment) It was rewritten in PL/SQL, to run after the import, which was an improvement, but it still has our creaky db server thrashing away. (You may have lots of helpful suggestions in mind, but we can't clean the data at source and AFAIK we can't do it incrementally because there is no timestamp or anything on the last change to a row from the legacy db.)
We don't need a general solution - if a line ends with a delimiter we can be sure it's the end of the entire line because that's the way the csv files are generated. I had a quick go with ByteString (with no attempt at robustness etc) and although I haven't compared it properly it seems faster than what we have now. But you can easily make it faster, surely! Hints for improvement please (e.g. can I unbox anything, make anything strict, or is that handled by ByteString, is there a more efficient library function to replace the fold...?). module Main where import System.Environment (getArgs) import qualified Data.ByteString.Char8 as B --remove newlines in the middle of 'columns' clean :: Char -> [B.ByteString] -> [B.ByteString] clean d = foldr (\x ys -> if B.null x || B.last x == d then x:ys else (B.append x $ head ys):(tail ys)) [] main = do args <- getArgs if length args < 2 then putStrLn "Usage: crunchFile INFILE OUTFILE [DELIM]" else do bs <- B.readFile (args!!0) let d = if length args == 3 then head (args!!2) else '"' B.writeFile (args!!1) $ (B.unlines . clean d . B.lines) bs Thanks, Jim -- View this message in context: http://www.nabble.com/Sneaking-haskell-in-the-workplace----cleaning-csv-files-tf3928931.html#a11142895 Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com. _______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe