Re: Space usage of lazy seqs
On Dec 2, 9:59 pm, Johann Hibschman joha...@gmail.com wrote: On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote: You can tune the max with -Xmx1G for example, to limit it to one GB. That's a good idea; then I'll know for sure if it's keeping a handle to the entire file. Ok, that's a relief. First of all, -Xmx1G isn't legal, at least for java 1.6; I had to specify -Xmx1024m. Second, once I did that, the memory use of the obvious parallel version, (reduce + (pmap ...)), remained within reason. Clojure is good, everything is happy, fuzzy bunnies and kittens frolic with abandon. So, all of this is a lot of hot air over nothing. Thanks for pointing me in the right direction. Cheers, Johann -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Thu, Dec 3, 2009 at 9:13 AM, Johann Hibschman joha...@gmail.com wrote: On Dec 2, 9:59 pm, Johann Hibschman joha...@gmail.com wrote: On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote: You can tune the max with -Xmx1G for example, to limit it to one GB. That's a good idea; then I'll know for sure if it's keeping a handle to the entire file. Ok, that's a relief. First of all, -Xmx1G isn't legal, at least for java 1.6; I had to specify -Xmx1024m. Second, once I did that, the memory use of the obvious parallel version, (reduce + (pmap ...)), remained within reason. Clojure is good, everything is happy, fuzzy bunnies and kittens frolic with abandon. So, all of this is a lot of hot air over nothing. Thanks for pointing me in the right direction. Cheers, Johann Another magic command line flag you could try playing with is -XX:+DoEscapeAnalysis It's hard to say whether it will help you or not because Sun has been switching its default from on and off seemingly randomly across the recent JVM updates. If it is not on by default on the JVM you're running though, I've found it to make a pretty hefty difference to a lot of clojure code, it's quite good at reducing the number of useless allocations clojure needs to do. --Aaron -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Wed, Dec 02, 2009 at 08:18:33PM -0800, Dave M wrote: On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote: ... If you're running JDK 6, you can run the virtualvm, or jconsole to get a better handle on the memory usage, and even dig into what it might used for. Google does not return useful references to a tool called virtualvm; perhaps you mean VisualVM (jvisualvm)? Yes, that is indeed what I meant to type. David -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Space usage of lazy seqs
I don't understand Clojure's space requirements when processing lazy sequences. Are there some rules-of-thumb that I could use to better predict what will use a lot of space? I have a 5.5 GB pipe-delimited data file, containing mostly floats (14 M rows of 40 cols). I'd like to stream over that file, processing columns as I go, without holding the whole thing in RAM. As a first test, I'm trying to just split each row and count the total number of fields. Why does reduce seem to load in the whole file, yet test-split-4 not? Why does the if-let in test-split-3 vs test-split-3b make such a difference? And finally, is there any way I can parallelize this to use multiple cores without slurping in the whole file? If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included with incanter. Here's the code: (defn afile /path/to/big/file) ;; Count the lines in the file. ;; 12.8 s, light memory use (0.8 GB). (defn test-count [] (with-open [rdr (duck-streams/reader afile)] (count (line-seq rdr ;; Split and count. ;; 183.2 s, heavy memory use (8.6 GB). (defn test-split [] (with-open [rdr (duck-streams/reader afile)] (reduce + (map #(count (.split %1 \\|)) (line-seq rdr) ;; 190.8 s, heavy memory use (8.8 GB). (defn test-split-2 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (seq (map #(count (.split %1 \\|)) (line-seq rdr))) cnt 0] (if counts (recur (next counts) (+ cnt (first counts))) cnt ;; Use rest instead, if-let (following http://clojure.org/lazy.) ;; 166.1 s, light memory use (1.4 GB) (defn test-split-3 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt ;; Try without the if-let. ;; 211.6 s, heavy memory use (8.7 GB). Surprise! (defn test-split-3b [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if (seq counts) (recur (rest counts) (+ cnt (first counts))) cnt ;; 160 s, light memory use. (1.5 GB) (defn test-split-4 [] (with-open [rdr (duck-streams/reader afile)] (loop [lines (line-seq rdr) cnt 0] (if lines (recur (next lines) (+ cnt (count (.split (first lines) \\| cnt ;; Parallel split and count. ;; Based on test-split-3, but using pmap. ;; 95.1 s, heavy memory use (8.7 GB) (defn test-psplit-1 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (pmap #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Dec 2, 10:50 am, Johann Hibschman joha...@gmail.com wrote: I don't understand Clojure's space requirements when processing lazy sequences. Are there some rules-of-thumb that I could use to better predict what will use a lot of space? I have a 5.5 GB pipe-delimited data file, containing mostly floats (14 M rows of 40 cols). I'd like to stream over that file, processing columns as I go, without holding the whole thing in RAM. As a first test, I'm trying to just split each row and count the total number of fields. Why does reduce seem to load in the whole file, yet test-split-4 not? Why does the if-let in test-split-3 vs test-split-3b make such a difference? And finally, is there any way I can parallelize this to use multiple cores without slurping in the whole file? If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included with incanter. Here's the code: (defn afile /path/to/big/file) ;; Count the lines in the file. ;; 12.8 s, light memory use (0.8 GB). (defn test-count [] (with-open [rdr (duck-streams/reader afile)] (count (line-seq rdr ;; Split and count. ;; 183.2 s, heavy memory use (8.6 GB). (defn test-split [] (with-open [rdr (duck-streams/reader afile)] (reduce + (map #(count (.split %1 \\|)) (line-seq rdr) ;; 190.8 s, heavy memory use (8.8 GB). (defn test-split-2 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (seq (map #(count (.split %1 \\|)) (line-seq rdr))) cnt 0] (if counts (recur (next counts) (+ cnt (first counts))) cnt ;; Use rest instead, if-let (followinghttp://clojure.org/lazy.) ;; 166.1 s, light memory use (1.4 GB) (defn test-split-3 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt ;; Try without the if-let. ;; 211.6 s, heavy memory use (8.7 GB). Surprise! (defn test-split-3b [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if (seq counts) (recur (rest counts) (+ cnt (first counts))) cnt ;; 160 s, light memory use. (1.5 GB) (defn test-split-4 [] (with-open [rdr (duck-streams/reader afile)] (loop [lines (line-seq rdr) cnt 0] (if lines (recur (next lines) (+ cnt (count (.split (first lines) \\| cnt ;; Parallel split and count. ;; Based on test-split-3, but using pmap. ;; 95.1 s, heavy memory use (8.7 GB) (defn test-psplit-1 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (pmap #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt After looking over the code, I'm inclined to not trust those numbers. You're generating a lot of intermediate String instances, which is where the memory is likely going. My guess is the wildly varying memory numbers are due to the GC kicking in (note the times are all in the same ballpark). -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Dec 2, 10:50 am, Johann Hibschman joha...@gmail.com wrote: I don't understand Clojure's space requirements when processing lazy sequences. Are there some rules-of-thumb that I could use to better predict what will use a lot of space? I have a 5.5 GB pipe-delimited data file, containing mostly floats (14 M rows of 40 cols). I'd like to stream over that file, processing columns as I go, without holding the whole thing in RAM. As a first test, I'm trying to just split each row and count the total number of fields. Why does reduce seem to load in the whole file, yet test-split-4 not? Why does the if-let in test-split-3 vs test-split-3b make such a difference? And finally, is there any way I can parallelize this to use multiple cores without slurping in the whole file? If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included with incanter. Here's the code: (defn afile /path/to/big/file) ;; Count the lines in the file. ;; 12.8 s, light memory use (0.8 GB). (defn test-count [] (with-open [rdr (duck-streams/reader afile)] (count (line-seq rdr ;; Split and count. ;; 183.2 s, heavy memory use (8.6 GB). (defn test-split [] (with-open [rdr (duck-streams/reader afile)] (reduce + (map #(count (.split %1 \\|)) (line-seq rdr) ;; 190.8 s, heavy memory use (8.8 GB). (defn test-split-2 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (seq (map #(count (.split %1 \\|)) (line-seq rdr))) cnt 0] (if counts (recur (next counts) (+ cnt (first counts))) cnt ;; Use rest instead, if-let (followinghttp://clojure.org/lazy.) ;; 166.1 s, light memory use (1.4 GB) (defn test-split-3 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt ;; Try without the if-let. ;; 211.6 s, heavy memory use (8.7 GB). Surprise! (defn test-split-3b [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if (seq counts) (recur (rest counts) (+ cnt (first counts))) cnt ;; 160 s, light memory use. (1.5 GB) (defn test-split-4 [] (with-open [rdr (duck-streams/reader afile)] (loop [lines (line-seq rdr) cnt 0] (if lines (recur (next lines) (+ cnt (count (.split (first lines) \\| cnt ;; Parallel split and count. ;; Based on test-split-3, but using pmap. ;; 95.1 s, heavy memory use (8.7 GB) (defn test-psplit-1 [] (with-open [rdr (duck-streams/reader afile)] (loop [counts (pmap #(count (.split %1 \\|)) (line-seq rdr)) cnt 0] (if-let [s (seq counts)] (recur (rest s) (+ cnt (first s))) cnt After reading the code, I'm inclined to not trust those numbers. Note that the time metrics for test-split* are all in the same ballpark, creating the same number of superfluous, intermediate String instances, but the memory numbers you list are wildly different. How are you collecting these numbers? Have you controlled for the GC kicking in? -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Wed, Dec 02, 2009 at 02:01:36PM -0800, Johann Hibschman wrote: There is a qualitative difference between the runs, though. I can run test-split-3 five times in a row, all with similar times, without having the java process size get bigger than 0.6 GB. When I run any of the others, the size quickly balloons up to something more like 8.5 GB. How much memory do you have on your machine. A recent Sun JVM on a machine with a bunch of memory will consider it to be a server machine. It will set the heap max to 1/4 of total physical memory (which suggests you might have 16GB of RAM). You can tune the max with -Xmx1G for example, to limit it to one GB. The actual interaction with the GC can be hard to predict, and Sun's GC seems to like to sometimes use as much memory as it has been given. If you're running JDK 6, you can run the virtualvm, or jconsole to get a better handle on the memory usage, and even dig into what it might used for. David -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote: ... If you're running JDK 6, you can run the virtualvm, or jconsole to get a better handle on the memory usage, and even dig into what it might used for. Google does not return useful references to a tool called virtualvm; perhaps you mean VisualVM (jvisualvm)? -Dave -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Space usage of lazy seqs
On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote: How much memory do you have on your machine. A recent Sun JVM on a machine with a bunch of memory will consider it to be a server machine. It will set the heap max to 1/4 of total physical memory (which suggests you might have 16GB of RAM). I have 96 GB, so I'm not in danger of running out. I just want to understand if I'm using the sequence functions properly, so that I can run a few instances of this, plus some R, etc. You can tune the max with -Xmx1G for example, to limit it to one GB. That's a good idea; then I'll know for sure if it's keeping a handle to the entire file. If you're running JDK 6, you can run the virtualvm, or jconsole to get a better handle on the memory usage, and even dig into what it might used for. Ah, I'd forgotten about jconsole. Well, I'll muddle around and see what I can figure out. Thanks, Johann -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en