Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Paul Ingles
Hi,

I've been playing around with breaking apart a list of postal codes to
be stored in a tree with leaf nodes containing information about that
area. I have some code that works with medium-ish size inputs but
fails with a GC Overhead error with larger input sets (1.5m rows) and
would really appreciate anyone being able to point me in the right
direction.

The full code is up as a gist here: https://gist.github.com/661278

My input file contains something like:

SW1A 1,10
SW1A 2,20
...

Which are then mapped to 2 trees:

{S {W {1 {A {1 10}
{S {W {1 {A {2 20}

I then want to continually fold those trees into a master tree. For
the 2 maps above the merged tree would be:

{S {W {1 {A {1 10 2 20}

I'm sure I'm missing all kinds of awesome core/contrib functions to
make it more concise and would appreciate anyone pointing out
alternatives also.

The main problem is that it fails when my input data gets sufficiently
large. On my MacBook Pro it falls over with an input set of 1.5m
records (although a lot of these would be branches from the first few
levels). It reports GC Overhead limit exceeded, although also ran out
of heap size before I upped that.

I assume this is because during the tree reduction it's still
retaining references to nodes eventually causing it to build
continually larger structures?

I've included the reduce function (and how that gets called to produce
results) inline:

(defn merge-tree
  [tree other]
  (if (not (map? other))
tree
(merge-with (fn [x y] (merge-tree x y))
tree other)))

(def results (reduce merge-tree
 {}
 (map record-to-tree
  postcodes-from-file)))

All help much appreciated,
Paul

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Ken Wesson
On Wed, Nov 3, 2010 at 12:22 PM, Paul Ingles p...@forward.co.uk wrote:

 (defn merge-tree
  [tree other]
  (if (not (map? other))
tree
(merge-with (fn [x y] (merge-tree x y))
tree other)))


You can get rid of the anonymous function here and just do (merge-with
merge-tree tree other).

That shouldn't be the cause of your problem, though.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Paul Mooser
I could be missing something, but since you say you're running into
problems as your data gets large, I think you're possibly running into
2 things:

1. You're reading all the data into memory, and keeping it there, in
the form of the lines from the file.
2. The way you're defining postcodes-from-file results in holding on
to the head of the sequence, meaning parts of the sequence that aren't
being used anymore can't be garbage collected. You probably want to
make this into a function rather than a def, so that it gets created
when you invoke the function, but not bound anywhere that causes it to
be a gc root.

Does that make sense?

On Nov 3, 9:22 am, Paul Ingles p...@forward.co.uk wrote:
 Hi,

 I've been playing around with breaking apart a list of postal codes to
 be stored in a tree with leaf nodes containing information about that
 area. I have some code that works with medium-ish size inputs but
 fails with a GC Overhead error with larger input sets (1.5m rows) and
 would really appreciate anyone being able to point me in the right
 direction.

 The full code is up as a gist here:https://gist.github.com/661278

 My input file contains something like:

 SW1A 1,10
 SW1A 2,20
 ...

 Which are then mapped to 2 trees:

 {S {W {1 {A {1 10}
 {S {W {1 {A {2 20}

 I then want to continually fold those trees into a master tree. For
 the 2 maps above the merged tree would be:

 {S {W {1 {A {1 10 2 20}

 I'm sure I'm missing all kinds of awesome core/contrib functions to
 make it more concise and would appreciate anyone pointing out
 alternatives also.

 The main problem is that it fails when my input data gets sufficiently
 large. On my MacBook Pro it falls over with an input set of 1.5m
 records (although a lot of these would be branches from the first few
 levels). It reports GC Overhead limit exceeded, although also ran out
 of heap size before I upped that.

 I assume this is because during the tree reduction it's still
 retaining references to nodes eventually causing it to build
 continually larger structures?

 I've included the reduce function (and how that gets called to produce
 results) inline:

 (defn merge-tree
   [tree other]
   (if (not (map? other))
     tree
     (merge-with (fn [x y] (merge-tree x y))
                 tree other)))

 (def results (reduce merge-tree
                      {}
                      (map record-to-tree
                           postcodes-from-file)))

 All help much appreciated,
 Paul

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Leif Walsh
Why not just sort the text file and then build the merged trees
directly, without the numerous intermediate trees?

On Wed, Nov 3, 2010 at 12:22 PM, Paul Ingles p...@forward.co.uk wrote:
 Hi,

 I've been playing around with breaking apart a list of postal codes to
 be stored in a tree with leaf nodes containing information about that
 area. I have some code that works with medium-ish size inputs but
 fails with a GC Overhead error with larger input sets (1.5m rows) and
 would really appreciate anyone being able to point me in the right
 direction.

 The full code is up as a gist here: https://gist.github.com/661278

 My input file contains something like:

 SW1A 1,10
 SW1A 2,20
 ...

 Which are then mapped to 2 trees:

 {S {W {1 {A {1 10}
 {S {W {1 {A {2 20}

 I then want to continually fold those trees into a master tree. For
 the 2 maps above the merged tree would be:

 {S {W {1 {A {1 10 2 20}

 I'm sure I'm missing all kinds of awesome core/contrib functions to
 make it more concise and would appreciate anyone pointing out
 alternatives also.

 The main problem is that it fails when my input data gets sufficiently
 large. On my MacBook Pro it falls over with an input set of 1.5m
 records (although a lot of these would be branches from the first few
 levels). It reports GC Overhead limit exceeded, although also ran out
 of heap size before I upped that.

 I assume this is because during the tree reduction it's still
 retaining references to nodes eventually causing it to build
 continually larger structures?

 I've included the reduce function (and how that gets called to produce
 results) inline:

 (defn merge-tree
  [tree other]
  (if (not (map? other))
    tree
    (merge-with (fn [x y] (merge-tree x y))
                tree other)))

 (def results (reduce merge-tree
                     {}
                     (map record-to-tree
                          postcodes-from-file)))

 All help much appreciated,
 Paul

 --
 You received this message because you are subscribed to the Google
 Groups Clojure group.
 To post to this group, send email to clojure@googlegroups.com
 Note that posts from new members are moderated - please be patient with your 
 first post.
 To unsubscribe from this group, send email to
 clojure+unsubscr...@googlegroups.com
 For more options, visit this group at
 http://groups.google.com/group/clojure?hl=en



-- 
Cheers,
Leif

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Andy Fingerhut
In addition to this, note that every java.lang.String, i.e. every A  
in your example data structures below, requires storage for one  
java.lang.Object (8 bytes on 32-bit JVMs, 16 bytes on 64-bit JVMs)  
plus an array of java char's, where I think that is equal to the size  
of a java.lang.Object plus 2 bytes for each char.


So A requires 8+8+2=18 bytes on a 32-bit JVM (perhaps more with  
padding of these objects for memory alignment purposes), or 16+16+2=34  
bytes on a 64-bit JVM (again, perhaps more with padding).  That is  
before taking into account the memory required for Clojure sets.


I'd recommend changing your code to use Java chars instead of single- 
char Java strings.  That plus the change Paul suggests below should  
reduce memory consumption significantly.


Andy

On Nov 3, 2010, at 5:26 PM, Paul Mooser wrote:


I could be missing something, but since you say you're running into
problems as your data gets large, I think you're possibly running into
2 things:

1. You're reading all the data into memory, and keeping it there, in
the form of the lines from the file.
2. The way you're defining postcodes-from-file results in holding on
to the head of the sequence, meaning parts of the sequence that aren't
being used anymore can't be garbage collected. You probably want to
make this into a function rather than a def, so that it gets created
when you invoke the function, but not bound anywhere that causes it to
be a gc root.

Does that make sense?

On Nov 3, 9:22 am, Paul Ingles p...@forward.co.uk wrote:

Hi,

I've been playing around with breaking apart a list of postal codes  
to

be stored in a tree with leaf nodes containing information about that
area. I have some code that works with medium-ish size inputs but
fails with a GC Overhead error with larger input sets (1.5m rows) and
would really appreciate anyone being able to point me in the right
direction.

The full code is up as a gist here:https://gist.github.com/661278

My input file contains something like:

SW1A 1,10
SW1A 2,20
...

Which are then mapped to 2 trees:

{S {W {1 {A {1 10}
{S {W {1 {A {2 20}

I then want to continually fold those trees into a master tree. For
the 2 maps above the merged tree would be:

{S {W {1 {A {1 10 2 20}

I'm sure I'm missing all kinds of awesome core/contrib functions to
make it more concise and would appreciate anyone pointing out
alternatives also.

The main problem is that it fails when my input data gets  
sufficiently

large. On my MacBook Pro it falls over with an input set of 1.5m
records (although a lot of these would be branches from the first few
levels). It reports GC Overhead limit exceeded, although also ran out
of heap size before I upped that.

I assume this is because during the tree reduction it's still
retaining references to nodes eventually causing it to build
continually larger structures?

I've included the reduce function (and how that gets called to  
produce

results) inline:

(defn merge-tree
  [tree other]
  (if (not (map? other))
tree
(merge-with (fn [x y] (merge-tree x y))
tree other)))

(def results (reduce merge-tree
 {}
 (map record-to-tree
  postcodes-from-file)))

All help much appreciated,
Paul


--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient  
with your first post.

To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Michael Gardner
On Nov 3, 2010, at 8:31 PM, Andy Fingerhut wrote:

 I'd recommend changing your code to use Java chars instead of single-char 
 Java strings.  That plus the change Paul suggests below should reduce memory 
 consumption significantly.

Another option is to .intern() the strings. You'll still create lots of garbage 
while reading the input file, though.

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Justin Kramer
Check out assoc-in, get-in, and update-in. They make working with
nested maps a breeze. Here's a rewrite of your code:

(ns user
  (:require [clojure.string :as str]
[clojure.java.io :as io]))

(def postcode-trie
  (with-open [r (io/reader /path/to/data.csv)]
(reduce
 (fn [trie line]
   (let [[postcode region] (.split line ,)
 postcode (str/replace postcode   )]
 (assoc-in trie postcode region)))
 {}
 (line-seq r

(get-in postcode-trie SW1A2)
;; = 20

This stores keys of the tree (trie) as characters instead of strings,
which lets you use get-in easily.

Using line-seq might help mitigate unnecessary memory usage, though as
Andy mentions, Java objects just carry a lot of baggage.

Justin

On Nov 3, 12:22 pm, Paul Ingles p...@forward.co.uk wrote:
 Hi,

 I've been playing around with breaking apart a list of postal codes to
 be stored in a tree with leaf nodes containing information about that
 area. I have some code that works with medium-ish size inputs but
 fails with a GC Overhead error with larger input sets (1.5m rows) and
 would really appreciate anyone being able to point me in the right
 direction.

 The full code is up as a gist here:https://gist.github.com/661278

 My input file contains something like:

 SW1A 1,10
 SW1A 2,20
 ...

 Which are then mapped to 2 trees:

 {S {W {1 {A {1 10}
 {S {W {1 {A {2 20}

 I then want to continually fold those trees into a master tree. For
 the 2 maps above the merged tree would be:

 {S {W {1 {A {1 10 2 20}

 I'm sure I'm missing all kinds of awesome core/contrib functions to
 make it more concise and would appreciate anyone pointing out
 alternatives also.

 The main problem is that it fails when my input data gets sufficiently
 large. On my MacBook Pro it falls over with an input set of 1.5m
 records (although a lot of these would be branches from the first few
 levels). It reports GC Overhead limit exceeded, although also ran out
 of heap size before I upped that.

 I assume this is because during the tree reduction it's still
 retaining references to nodes eventually causing it to build
 continually larger structures?

 I've included the reduce function (and how that gets called to produce
 results) inline:

 (defn merge-tree
   [tree other]
   (if (not (map? other))
     tree
     (merge-with (fn [x y] (merge-tree x y))
                 tree other)))

 (def results (reduce merge-tree
                      {}
                      (map record-to-tree
                           postcodes-from-file)))

 All help much appreciated,
 Paul

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


Re: Constructing Nested Trees and Memory Consumption

2010-11-03 Thread Andy Fingerhut
One more suggestion:  Try simply creating one giant map that maps  
complete postal code strings directly to the associated data values,  
without any tree data structure explicitly created in your code.  The  
code will be a lot simpler, and the underlying data structure is  
likely to be competitive memory-wise and access-time-wise to the data  
structure you are currently using.  You can also use a transient map  
[1] as long as you are content to limit yourself to a single thread  
doing the reading and map construction -- it will be faster than a non- 
transient map.  I suspect it might generate somewhat less garbage  
needing collection while you are building it, too.


[1] see http://clojure.org/transients

Andy

On Nov 3, 2010, at 6:31 PM, Andy Fingerhut wrote:

In addition to this, note that every java.lang.String, i.e. every  
A in your example data structures below, requires storage for one  
java.lang.Object (8 bytes on 32-bit JVMs, 16 bytes on 64-bit JVMs)  
plus an array of java char's, where I think that is equal to the  
size of a java.lang.Object plus 2 bytes for each char.


So A requires 8+8+2=18 bytes on a 32-bit JVM (perhaps more with  
padding of these objects for memory alignment purposes), or  
16+16+2=34 bytes on a 64-bit JVM (again, perhaps more with  
padding).  That is before taking into account the memory required  
for Clojure sets.


I'd recommend changing your code to use Java chars instead of single- 
char Java strings.  That plus the change Paul suggests below should  
reduce memory consumption significantly.


Andy

On Nov 3, 2010, at 5:26 PM, Paul Mooser wrote:


I could be missing something, but since you say you're running into
problems as your data gets large, I think you're possibly running  
into

2 things:

1. You're reading all the data into memory, and keeping it there, in
the form of the lines from the file.
2. The way you're defining postcodes-from-file results in holding on
to the head of the sequence, meaning parts of the sequence that  
aren't

being used anymore can't be garbage collected. You probably want to
make this into a function rather than a def, so that it gets created
when you invoke the function, but not bound anywhere that causes it  
to

be a gc root.

Does that make sense?

On Nov 3, 9:22 am, Paul Ingles p...@forward.co.uk wrote:

Hi,

I've been playing around with breaking apart a list of postal  
codes to
be stored in a tree with leaf nodes containing information about  
that

area. I have some code that works with medium-ish size inputs but
fails with a GC Overhead error with larger input sets (1.5m rows)  
and

would really appreciate anyone being able to point me in the right
direction.

The full code is up as a gist here:https://gist.github.com/661278

My input file contains something like:

SW1A 1,10
SW1A 2,20
...

Which are then mapped to 2 trees:

{S {W {1 {A {1 10}
{S {W {1 {A {2 20}

I then want to continually fold those trees into a master tree. For
the 2 maps above the merged tree would be:

{S {W {1 {A {1 10 2 20}

I'm sure I'm missing all kinds of awesome core/contrib functions to
make it more concise and would appreciate anyone pointing out
alternatives also.

The main problem is that it fails when my input data gets  
sufficiently

large. On my MacBook Pro it falls over with an input set of 1.5m
records (although a lot of these would be branches from the first  
few
levels). It reports GC Overhead limit exceeded, although also ran  
out

of heap size before I upped that.

I assume this is because during the tree reduction it's still
retaining references to nodes eventually causing it to build
continually larger structures?

I've included the reduce function (and how that gets called to  
produce

results) inline:

(defn merge-tree
 [tree other]
 (if (not (map? other))
   tree
   (merge-with (fn [x y] (merge-tree x y))
   tree other)))

(def results (reduce merge-tree
{}
(map record-to-tree
 postcodes-from-file)))

All help much appreciated,
Paul


--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient  
with your first post.

To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en




--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en