Re: Constructing Nested Trees and Memory Consumption

Andy Fingerhut Wed, 03 Nov 2010 21:17:22 -0700

One more suggestion: Try simply creating one giant map that mapscomplete postal code strings directly to the associated data values,without any tree data structure explicitly created in your code. Thecode will be a lot simpler, and the underlying data structure islikely to be competitive memory-wise and access-time-wise to the datastructure you are currently using. You can also use a transient map[1] as long as you are content to limit yourself to a single threaddoing the reading and map construction -- it will be faster than a non-transient map. I suspect it might generate somewhat less garbageneeding collection while you are building it, too.


[1] see http://clojure.org/transients


Andy

On Nov 3, 2010, at 6:31 PM, Andy Fingerhut wrote:

In addition to this, note that every java.lang.String, i.e. every"A" in your example data structures below, requires storage for onejava.lang.Object (8 bytes on 32-bit JVMs, 16 bytes on 64-bit JVMs)plus an array of java char's, where I think that is equal to thesize of a java.lang.Object plus 2 bytes for each char.
So "A" requires 8+8+2=18 bytes on a 32-bit JVM (perhaps more withpadding of these objects for memory alignment purposes), or16+16+2=34 bytes on a 64-bit JVM (again, perhaps more withpadding). That is before taking into account the memory requiredfor Clojure sets.
I'd recommend changing your code to use Java chars instead of single-char Java strings. That plus the change Paul suggests below shouldreduce memory consumption significantly.
Andy

On Nov 3, 2010, at 5:26 PM, Paul Mooser wrote:
I could be missing something, but since you say you're running into
problems as your data gets large, I think you're possibly runninginto
2 things:

1. You're reading all the data into memory, and keeping it there, in
the form of the lines from the file.
2. The way you're defining postcodes-from-file results in holding on
to the head of the sequence, meaning parts of the sequence thataren't
being used anymore can't be garbage collected. You probably want to
make this into a function rather than a def, so that it gets created
when you invoke the function, but not bound anywhere that causes itto
be a gc root.

Does that make sense?

On Nov 3, 9:22 am, Paul Ingles <p...@forward.co.uk> wrote:
Hi,
I've been playing around with breaking apart a list of postalcodes tobe stored in a tree with leaf nodes containing information aboutthat
area. I have some code that works with medium-ish size inputs but
fails with a GC Overhead error with larger input sets (1.5m rows)and
would really appreciate anyone being able to point me in the right
direction.

The full code is up as a gist here:https://gist.github.com/661278

My input file contains something like:

SW1A 1,10
SW1A 2,20
...

Which are then mapped to 2 trees:

{"S" {"W" {"1" {"A" {"1" 10}}}}}
{"S" {"W" {"1" {"A" {"2" 20}}}}}

I then want to continually fold those trees into a master tree. For
the 2 maps above the merged tree would be:

{"S" {"W" {"1" {"A" {"1" 10 "2" 20}}}}}

I'm sure I'm missing all kinds of awesome core/contrib functions to
make it more concise and would appreciate anyone pointing out
alternatives also.
The main problem is that it fails when my input data getssufficiently
large. On my MacBook Pro it falls over with an input set of 1.5m
records (although a lot of these would be branches from the firstfewlevels). It reports GC Overhead limit exceeded, although also ranout
of heap size before I upped that.

I assume this is because during the tree reduction it's still
retaining references to nodes eventually causing it to build
continually larger structures?
I've included the reduce function (and how that gets called toproduce
results) inline:

(defn merge-tree
 [tree other]
 (if (not (map? other))
   tree
   (merge-with (fn [x y] (merge-tree x y))
               tree other)))

(def results (reduce merge-tree
                    {}
                    (map record-to-tree
                         postcodes-from-file)))

All help much appreciated,
Paul
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patientwith your first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en


--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Re: Constructing Nested Trees and Memory Consumption

Reply via email to