One more suggestion: Try simply creating one giant map that maps complete postal code strings directly to the associated data values, without any tree data structure explicitly created in your code. The code will be a lot simpler, and the underlying data structure is likely to be competitive memory-wise and access-time-wise to the data structure you are currently using. You can also use a transient map [1] as long as you are content to limit yourself to a single thread doing the reading and map construction -- it will be faster than a non- transient map. I suspect it might generate somewhat less garbage needing collection while you are building it, too.

[1] see http://clojure.org/transients

Andy

On Nov 3, 2010, at 6:31 PM, Andy Fingerhut wrote:

In addition to this, note that every java.lang.String, i.e. every "A" in your example data structures below, requires storage for one java.lang.Object (8 bytes on 32-bit JVMs, 16 bytes on 64-bit JVMs) plus an array of java char's, where I think that is equal to the size of a java.lang.Object plus 2 bytes for each char.

So "A" requires 8+8+2=18 bytes on a 32-bit JVM (perhaps more with padding of these objects for memory alignment purposes), or 16+16+2=34 bytes on a 64-bit JVM (again, perhaps more with padding). That is before taking into account the memory required for Clojure sets.

I'd recommend changing your code to use Java chars instead of single- char Java strings. That plus the change Paul suggests below should reduce memory consumption significantly.

Andy

On Nov 3, 2010, at 5:26 PM, Paul Mooser wrote:

I could be missing something, but since you say you're running into
problems as your data gets large, I think you're possibly running into
2 things:

1. You're reading all the data into memory, and keeping it there, in
the form of the lines from the file.
2. The way you're defining postcodes-from-file results in holding on
to the head of the sequence, meaning parts of the sequence that aren't
being used anymore can't be garbage collected. You probably want to
make this into a function rather than a def, so that it gets created
when you invoke the function, but not bound anywhere that causes it to
be a gc root.

Does that make sense?

On Nov 3, 9:22 am, Paul Ingles <p...@forward.co.uk> wrote:
Hi,

I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that
area. I have some code that works with medium-ish size inputs but
fails with a GC Overhead error with larger input sets (1.5m rows) and
would really appreciate anyone being able to point me in the right
direction.

The full code is up as a gist here:https://gist.github.com/661278

My input file contains something like:

SW1A 1,10
SW1A 2,20
...

Which are then mapped to 2 trees:

{"S" {"W" {"1" {"A" {"1" 10}}}}}
{"S" {"W" {"1" {"A" {"2" 20}}}}}

I then want to continually fold those trees into a master tree. For
the 2 maps above the merged tree would be:

{"S" {"W" {"1" {"A" {"1" 10 "2" 20}}}}}

I'm sure I'm missing all kinds of awesome core/contrib functions to
make it more concise and would appreciate anyone pointing out
alternatives also.

The main problem is that it fails when my input data gets sufficiently
large. On my MacBook Pro it falls over with an input set of 1.5m
records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out
of heap size before I upped that.

I assume this is because during the tree reduction it's still
retaining references to nodes eventually causing it to build
continually larger structures?

I've included the reduce function (and how that gets called to produce
results) inline:

(defn merge-tree
 [tree other]
 (if (not (map? other))
   tree
   (merge-with (fn [x y] (merge-tree x y))
               tree other)))

(def results (reduce merge-tree
                    {}
                    (map record-to-tree
                         postcodes-from-file)))

All help much appreciated,
Paul

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to