Hi Felipe, Thank you for the very detailed explanation and help. Regarding the first point, for this particular use case it's fine if the user-specified file size is extended by the length of a partial line (it's a compact csv file so if the user breaks a big file into 100mb chunks, each chunk would only ever be about 100mb + up to 80 bytes, which is fine for the user).
I'm intrigued by the idea of making the bulk copy function with EB.isolate and EB.iterHandle, but I couldn't find a way to fit these into the larger context of writing to multiple file handles. I'll keep working on it and see if I can address the concerns you brought up. Thanks again! Eric On Fri, Jul 22, 2011 at 6:00 PM, Felipe Almeida Lessa < felipe.le...@gmail.com> wrote: > There is one problem with your algorithm. If the user asks for 4 GiB, > then the program will create files with *at least* 4 GiB. So the user > would need to ask for less, maybe 3.9 GiB. Even so there's some > danger, because there could be a 0.11 GiB line on the file. > > Now, the biggest problem your code won't run in constant memory. > 'EB.take' does not lazily return a lazy ByteString. It strictly > returns a lazy ByteString [1]. The lazy ByteString is used to avoid > copying data (as it is basically the same as a linked list of strict > bytestrings). So if the user asked for 4 GiB files, this program > would need at least 4 GiB of memory, probably more due to overheads. > > If you want to use lazy lazy ByteStrings (lazy ByteStrings with lazy > I/O, as oposed to lazy ByteStrings with strict I/O), the enumerator > package doesn't really buy you anything. You should just use > bytestring package's lazy I/O functions. > > If you want the guarantee of no leaks that enumerator gives, then you > have to use another way of constructing your program. One safe way of > doing it is something like: > > takeNextLine :: E.Iteratee B.ByteString m (Maybe L.ByteString) > takeNextLine = ... > > go :: Monad m => Handle -> Int64 -> E.Iteratee B.ByteString m (Maybe > L.ByteString) > go h n = do > mline <- takeNextLine > case mline of > Nothing -> return Nothing > Just line > | L.length line <= n -> L.hPut h line >> go h (n - L.length line) > | otherwise -> return mline > > So 'go h n' is the iteratee that saves at most 'n' bytes in handle 'h' > and returns the leftover data. The driver code needs to check its > results. Case 'Nothing', then the program finishes. Case 'Just > line', save line on a new file and call 'go h2 (n - L.length line)'. > It isn't efficient because lines could be small, resulting in many > small hPuts (bad). But it is correct and will never use more than 'n' > bytes (great). You could also have some compromise where the user > says that he'll never have lines longer than 'x' bytes (say, 1 MiB). > Then you call a bulk copy function for 'n - x' bytes, and then call > 'go h x'. I think you can make the bulk copy function with EB.isolate > and EB.iterHandle. > > Cheers, =) > > [1] > http://hackage.haskell.org/packages/archive/enumerator/0.4.13.1/doc/html/src/Data-Enumerator-Binary.html#take > > -- > Felipe. >
_______________________________________________ Haskell-Cafe mailing list Haskell-Cafe@haskell.org http://www.haskell.org/mailman/listinfo/haskell-cafe