Re: Space usage of lazy seqs

2009-12-03 Thread Johann Hibschman
On Dec 2, 9:59 pm, Johann Hibschman joha...@gmail.com wrote:
 On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote:

  You can tune the max with -Xmx1G for example, to limit it to one GB.

 That's a good idea; then I'll know for sure if it's keeping a handle
 to the entire file.

Ok, that's a relief.

First of all, -Xmx1G isn't legal, at least for java 1.6; I had to
specify -Xmx1024m. Second, once I did that, the memory use of the
obvious parallel version, (reduce + (pmap ...)), remained within
reason. Clojure is good, everything is happy, fuzzy bunnies and
kittens frolic with abandon.

So, all of this is a lot of hot air over nothing. Thanks for pointing
me in the right direction.

Cheers,
Johann

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-03 Thread Aaron Cohen
On Thu, Dec 3, 2009 at 9:13 AM, Johann Hibschman joha...@gmail.com wrote:
 On Dec 2, 9:59 pm, Johann Hibschman joha...@gmail.com wrote:
 On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote:

  You can tune the max with -Xmx1G for example, to limit it to one GB.

 That's a good idea; then I'll know for sure if it's keeping a handle
 to the entire file.

 Ok, that's a relief.

 First of all, -Xmx1G isn't legal, at least for java 1.6; I had to
 specify -Xmx1024m. Second, once I did that, the memory use of the
 obvious parallel version, (reduce + (pmap ...)), remained within
 reason. Clojure is good, everything is happy, fuzzy bunnies and
 kittens frolic with abandon.

 So, all of this is a lot of hot air over nothing. Thanks for pointing
 me in the right direction.

 Cheers,
 Johann


Another magic command line flag you could try playing with is
-XX:+DoEscapeAnalysis

It's hard to say whether it will help you or not because Sun has been
switching its default from on and off seemingly randomly across the
recent JVM updates. If it is not on by default on the JVM you're
running though, I've found it to make a pretty hefty difference to a
lot of clojure code, it's quite good at reducing the number of
useless allocations clojure needs to do.

--Aaron

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-03 Thread David Brown
On Wed, Dec 02, 2009 at 08:18:33PM -0800, Dave M wrote:

On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote:
...
 If you're running JDK 6, you can run the virtualvm, or jconsole to get
 a better handle on the memory usage, and even dig into what it might
 used for.

Google does not return useful references to a tool called virtualvm;
perhaps you mean VisualVM (jvisualvm)?

Yes, that is indeed what I meant to type.

David

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Space usage of lazy seqs

2009-12-02 Thread Johann Hibschman
I don't understand Clojure's space requirements when processing lazy
sequences. Are there some rules-of-thumb that I could use to better
predict what will use a lot of space?

I have a 5.5 GB pipe-delimited data file, containing mostly floats (14
M rows of 40 cols). I'd like to stream over that file, processing
columns as I go, without holding the whole thing in RAM. As a first
test, I'm trying to just split each row and count the total number of
fields.

Why does reduce seem to load in the whole file, yet test-split-4 not?
Why does the if-let in test-split-3 vs test-split-3b make such a
difference? And finally, is there any way I can parallelize this to
use multiple cores without slurping in the whole file?

If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included
with incanter.

Here's the code:

(defn afile /path/to/big/file)

;; Count the lines in the file.
;; 12.8 s, light memory use (0.8 GB).
(defn test-count []
  (with-open [rdr (duck-streams/reader afile)]
(count (line-seq rdr

;; Split and count.
;; 183.2 s, heavy memory use (8.6 GB).
(defn test-split []
  (with-open [rdr (duck-streams/reader afile)]
(reduce + (map #(count (.split %1 \\|)) (line-seq rdr)

;; 190.8 s, heavy memory use (8.8 GB).
(defn test-split-2 []
  (with-open [rdr (duck-streams/reader afile)]
(loop [counts (seq (map #(count (.split %1 \\|)) (line-seq
rdr)))
   cnt 0]
  (if counts
(recur (next counts) (+ cnt (first counts)))
cnt

;; Use rest instead, if-let (following http://clojure.org/lazy.)
;; 166.1 s, light memory use (1.4 GB)
(defn test-split-3 []
  (with-open [rdr (duck-streams/reader afile)]
(loop [counts (map #(count (.split %1 \\|)) (line-seq rdr))
   cnt 0]
  (if-let [s (seq counts)]
(recur (rest s) (+ cnt (first s)))
cnt

;; Try without the if-let.
;; 211.6 s, heavy memory use (8.7 GB). Surprise!
(defn test-split-3b []
  (with-open [rdr (duck-streams/reader afile)]
(loop [counts (map #(count (.split %1 \\|)) (line-seq rdr))
   cnt 0]
  (if (seq counts)
(recur (rest counts) (+ cnt (first counts)))
cnt

;; 160 s, light memory use. (1.5 GB)
(defn test-split-4 []
  (with-open [rdr (duck-streams/reader afile)]
(loop [lines (line-seq rdr)
   cnt 0]
  (if lines
(recur (next lines)
   (+ cnt (count (.split (first lines) \\|
cnt

;; Parallel split and count.
;; Based on test-split-3, but using pmap.
;; 95.1 s, heavy memory use (8.7 GB)
(defn test-psplit-1 []
  (with-open [rdr (duck-streams/reader afile)]
(loop [counts (pmap #(count (.split %1 \\|)) (line-seq rdr))
   cnt 0]
  (if-let [s (seq counts)]
(recur (rest s) (+ cnt (first s)))
cnt

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-02 Thread ataggart


On Dec 2, 10:50 am, Johann Hibschman joha...@gmail.com wrote:
 I don't understand Clojure's space requirements when processing lazy
 sequences. Are there some rules-of-thumb that I could use to better
 predict what will use a lot of space?

 I have a 5.5 GB pipe-delimited data file, containing mostly floats (14
 M rows of 40 cols). I'd like to stream over that file, processing
 columns as I go, without holding the whole thing in RAM. As a first
 test, I'm trying to just split each row and count the total number of
 fields.

 Why does reduce seem to load in the whole file, yet test-split-4 not?
 Why does the if-let in test-split-3 vs test-split-3b make such a
 difference? And finally, is there any way I can parallelize this to
 use multiple cores without slurping in the whole file?

 If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included
 with incanter.

 Here's the code:

 (defn afile /path/to/big/file)

 ;; Count the lines in the file.
 ;; 12.8 s, light memory use (0.8 GB).
 (defn test-count []
   (with-open [rdr (duck-streams/reader afile)]
     (count (line-seq rdr

 ;; Split and count.
 ;; 183.2 s, heavy memory use (8.6 GB).
 (defn test-split []
   (with-open [rdr (duck-streams/reader afile)]
     (reduce + (map #(count (.split %1 \\|)) (line-seq rdr)

 ;; 190.8 s, heavy memory use (8.8 GB).
 (defn test-split-2 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (seq (map #(count (.split %1 \\|)) (line-seq
 rdr)))
            cnt 0]
       (if counts
         (recur (next counts) (+ cnt (first counts)))
         cnt

 ;; Use rest instead, if-let (followinghttp://clojure.org/lazy.)
 ;; 166.1 s, light memory use (1.4 GB)
 (defn test-split-3 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr))
            cnt 0]
       (if-let [s (seq counts)]
         (recur (rest s) (+ cnt (first s)))
         cnt

 ;; Try without the if-let.
 ;; 211.6 s, heavy memory use (8.7 GB). Surprise!
 (defn test-split-3b []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr))
            cnt 0]
       (if (seq counts)
         (recur (rest counts) (+ cnt (first counts)))
         cnt

 ;; 160 s, light memory use. (1.5 GB)
 (defn test-split-4 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [lines (line-seq rdr)
            cnt 0]
       (if lines
         (recur (next lines)
                (+ cnt (count (.split (first lines) \\|
         cnt

 ;; Parallel split and count.
 ;; Based on test-split-3, but using pmap.
 ;; 95.1 s, heavy memory use (8.7 GB)
 (defn test-psplit-1 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (pmap #(count (.split %1 \\|)) (line-seq rdr))
            cnt 0]
       (if-let [s (seq counts)]
         (recur (rest s) (+ cnt (first s)))
         cnt

After looking over the code, I'm inclined to not trust those numbers.
You're generating a lot of intermediate String instances, which is
where the memory is likely going.  My guess is the wildly varying
memory numbers are due to the GC kicking in (note the times are all in
the same ballpark).

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-02 Thread ataggart


On Dec 2, 10:50 am, Johann Hibschman joha...@gmail.com wrote:
 I don't understand Clojure's space requirements when processing lazy
 sequences. Are there some rules-of-thumb that I could use to better
 predict what will use a lot of space?

 I have a 5.5 GB pipe-delimited data file, containing mostly floats (14
 M rows of 40 cols). I'd like to stream over that file, processing
 columns as I go, without holding the whole thing in RAM. As a first
 test, I'm trying to just split each row and count the total number of
 fields.

 Why does reduce seem to load in the whole file, yet test-split-4 not?
 Why does the if-let in test-split-3 vs test-split-3b make such a
 difference? And finally, is there any way I can parallelize this to
 use multiple cores without slurping in the whole file?

 If it matters, I'm using a snapshot of 1.1.0-alpha; the jar included
 with incanter.

 Here's the code:

 (defn afile /path/to/big/file)

 ;; Count the lines in the file.
 ;; 12.8 s, light memory use (0.8 GB).
 (defn test-count []
   (with-open [rdr (duck-streams/reader afile)]
     (count (line-seq rdr

 ;; Split and count.
 ;; 183.2 s, heavy memory use (8.6 GB).
 (defn test-split []
   (with-open [rdr (duck-streams/reader afile)]
     (reduce + (map #(count (.split %1 \\|)) (line-seq rdr)

 ;; 190.8 s, heavy memory use (8.8 GB).
 (defn test-split-2 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (seq (map #(count (.split %1 \\|)) (line-seq
 rdr)))
            cnt 0]
       (if counts
         (recur (next counts) (+ cnt (first counts)))
         cnt

 ;; Use rest instead, if-let (followinghttp://clojure.org/lazy.)
 ;; 166.1 s, light memory use (1.4 GB)
 (defn test-split-3 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr))
            cnt 0]
       (if-let [s (seq counts)]
         (recur (rest s) (+ cnt (first s)))
         cnt

 ;; Try without the if-let.
 ;; 211.6 s, heavy memory use (8.7 GB). Surprise!
 (defn test-split-3b []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (map #(count (.split %1 \\|)) (line-seq rdr))
            cnt 0]
       (if (seq counts)
         (recur (rest counts) (+ cnt (first counts)))
         cnt

 ;; 160 s, light memory use. (1.5 GB)
 (defn test-split-4 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [lines (line-seq rdr)
            cnt 0]
       (if lines
         (recur (next lines)
                (+ cnt (count (.split (first lines) \\|
         cnt

 ;; Parallel split and count.
 ;; Based on test-split-3, but using pmap.
 ;; 95.1 s, heavy memory use (8.7 GB)
 (defn test-psplit-1 []
   (with-open [rdr (duck-streams/reader afile)]
     (loop [counts (pmap #(count (.split %1 \\|)) (line-seq rdr))
            cnt 0]
       (if-let [s (seq counts)]
         (recur (rest s) (+ cnt (first s)))
         cnt

After reading the code, I'm inclined to not trust those numbers.  Note
that the time metrics for test-split* are all in the same ballpark,
creating the same number of superfluous, intermediate String
instances, but the memory numbers you list are wildly different.  How
are you collecting these numbers?  Have you controlled for the GC
kicking in?

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-02 Thread David Brown
On Wed, Dec 02, 2009 at 02:01:36PM -0800, Johann Hibschman wrote:

There is a qualitative difference between the runs, though. I can run
test-split-3 five times in a row, all with similar times, without
having the java process size get bigger than 0.6 GB. When I run any of
the others, the size quickly balloons up to something more like 8.5
GB.

How much memory do you have on your machine.  A recent Sun JVM on a
machine with a bunch of memory will consider it to be a server
machine.  It will set the heap max to 1/4 of total physical memory
(which suggests you might have 16GB of RAM).

You can tune the max with -Xmx1G for example, to limit it to one GB.

The actual interaction with the GC can be hard to predict, and Sun's
GC seems to like to sometimes use as much memory as it has been given.

If you're running JDK 6, you can run the virtualvm, or jconsole to get
a better handle on the memory usage, and even dig into what it might
used for.

David

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-02 Thread Dave M


On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote:
...
 If you're running JDK 6, you can run the virtualvm, or jconsole to get
 a better handle on the memory usage, and even dig into what it might
 used for.

Google does not return useful references to a tool called virtualvm;
perhaps you mean VisualVM (jvisualvm)?

-Dave

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Space usage of lazy seqs

2009-12-02 Thread Johann Hibschman
On Dec 2, 9:09 pm, David Brown cloj...@davidb.org wrote:
 How much memory do you have on your machine.  A recent Sun JVM on a
 machine with a bunch of memory will consider it to be a server
 machine.  It will set the heap max to 1/4 of total physical memory
 (which suggests you might have 16GB of RAM).

I have 96 GB, so I'm not in danger of running out. I just want to
understand if I'm using the sequence functions properly, so that I can
run a few instances of this, plus some R, etc.

 You can tune the max with -Xmx1G for example, to limit it to one GB.

That's a good idea; then I'll know for sure if it's keeping a handle
to the entire file.

 If you're running JDK 6, you can run the virtualvm, or jconsole to get
 a better handle on the memory usage, and even dig into what it might
 used for.

Ah, I'd forgotten about jconsole. Well, I'll muddle around and see
what I can figure out.

Thanks,
Johann

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en