Constructing Nested Trees and Memory Consumption
Hi, I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that area. I have some code that works with medium-ish size inputs but fails with a GC Overhead error with larger input sets (1.5m rows) and would really appreciate anyone being able to point me in the right direction. The full code is up as a gist here: https://gist.github.com/661278 My input file contains something like: SW1A 1,10 SW1A 2,20 ... Which are then mapped to 2 trees: {S {W {1 {A {1 10} {S {W {1 {A {2 20} I then want to continually fold those trees into a master tree. For the 2 maps above the merged tree would be: {S {W {1 {A {1 10 2 20} I'm sure I'm missing all kinds of awesome core/contrib functions to make it more concise and would appreciate anyone pointing out alternatives also. The main problem is that it fails when my input data gets sufficiently large. On my MacBook Pro it falls over with an input set of 1.5m records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out of heap size before I upped that. I assume this is because during the tree reduction it's still retaining references to nodes eventually causing it to build continually larger structures? I've included the reduce function (and how that gets called to produce results) inline: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) (def results (reduce merge-tree {} (map record-to-tree postcodes-from-file))) All help much appreciated, Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
On Wed, Nov 3, 2010 at 12:22 PM, Paul Ingles p...@forward.co.uk wrote: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) You can get rid of the anonymous function here and just do (merge-with merge-tree tree other). That shouldn't be the cause of your problem, though. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
I could be missing something, but since you say you're running into problems as your data gets large, I think you're possibly running into 2 things: 1. You're reading all the data into memory, and keeping it there, in the form of the lines from the file. 2. The way you're defining postcodes-from-file results in holding on to the head of the sequence, meaning parts of the sequence that aren't being used anymore can't be garbage collected. You probably want to make this into a function rather than a def, so that it gets created when you invoke the function, but not bound anywhere that causes it to be a gc root. Does that make sense? On Nov 3, 9:22 am, Paul Ingles p...@forward.co.uk wrote: Hi, I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that area. I have some code that works with medium-ish size inputs but fails with a GC Overhead error with larger input sets (1.5m rows) and would really appreciate anyone being able to point me in the right direction. The full code is up as a gist here:https://gist.github.com/661278 My input file contains something like: SW1A 1,10 SW1A 2,20 ... Which are then mapped to 2 trees: {S {W {1 {A {1 10} {S {W {1 {A {2 20} I then want to continually fold those trees into a master tree. For the 2 maps above the merged tree would be: {S {W {1 {A {1 10 2 20} I'm sure I'm missing all kinds of awesome core/contrib functions to make it more concise and would appreciate anyone pointing out alternatives also. The main problem is that it fails when my input data gets sufficiently large. On my MacBook Pro it falls over with an input set of 1.5m records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out of heap size before I upped that. I assume this is because during the tree reduction it's still retaining references to nodes eventually causing it to build continually larger structures? I've included the reduce function (and how that gets called to produce results) inline: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) (def results (reduce merge-tree {} (map record-to-tree postcodes-from-file))) All help much appreciated, Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
Why not just sort the text file and then build the merged trees directly, without the numerous intermediate trees? On Wed, Nov 3, 2010 at 12:22 PM, Paul Ingles p...@forward.co.uk wrote: Hi, I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that area. I have some code that works with medium-ish size inputs but fails with a GC Overhead error with larger input sets (1.5m rows) and would really appreciate anyone being able to point me in the right direction. The full code is up as a gist here: https://gist.github.com/661278 My input file contains something like: SW1A 1,10 SW1A 2,20 ... Which are then mapped to 2 trees: {S {W {1 {A {1 10} {S {W {1 {A {2 20} I then want to continually fold those trees into a master tree. For the 2 maps above the merged tree would be: {S {W {1 {A {1 10 2 20} I'm sure I'm missing all kinds of awesome core/contrib functions to make it more concise and would appreciate anyone pointing out alternatives also. The main problem is that it fails when my input data gets sufficiently large. On my MacBook Pro it falls over with an input set of 1.5m records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out of heap size before I upped that. I assume this is because during the tree reduction it's still retaining references to nodes eventually causing it to build continually larger structures? I've included the reduce function (and how that gets called to produce results) inline: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) (def results (reduce merge-tree {} (map record-to-tree postcodes-from-file))) All help much appreciated, Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- Cheers, Leif -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
In addition to this, note that every java.lang.String, i.e. every A in your example data structures below, requires storage for one java.lang.Object (8 bytes on 32-bit JVMs, 16 bytes on 64-bit JVMs) plus an array of java char's, where I think that is equal to the size of a java.lang.Object plus 2 bytes for each char. So A requires 8+8+2=18 bytes on a 32-bit JVM (perhaps more with padding of these objects for memory alignment purposes), or 16+16+2=34 bytes on a 64-bit JVM (again, perhaps more with padding). That is before taking into account the memory required for Clojure sets. I'd recommend changing your code to use Java chars instead of single- char Java strings. That plus the change Paul suggests below should reduce memory consumption significantly. Andy On Nov 3, 2010, at 5:26 PM, Paul Mooser wrote: I could be missing something, but since you say you're running into problems as your data gets large, I think you're possibly running into 2 things: 1. You're reading all the data into memory, and keeping it there, in the form of the lines from the file. 2. The way you're defining postcodes-from-file results in holding on to the head of the sequence, meaning parts of the sequence that aren't being used anymore can't be garbage collected. You probably want to make this into a function rather than a def, so that it gets created when you invoke the function, but not bound anywhere that causes it to be a gc root. Does that make sense? On Nov 3, 9:22 am, Paul Ingles p...@forward.co.uk wrote: Hi, I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that area. I have some code that works with medium-ish size inputs but fails with a GC Overhead error with larger input sets (1.5m rows) and would really appreciate anyone being able to point me in the right direction. The full code is up as a gist here:https://gist.github.com/661278 My input file contains something like: SW1A 1,10 SW1A 2,20 ... Which are then mapped to 2 trees: {S {W {1 {A {1 10} {S {W {1 {A {2 20} I then want to continually fold those trees into a master tree. For the 2 maps above the merged tree would be: {S {W {1 {A {1 10 2 20} I'm sure I'm missing all kinds of awesome core/contrib functions to make it more concise and would appreciate anyone pointing out alternatives also. The main problem is that it fails when my input data gets sufficiently large. On my MacBook Pro it falls over with an input set of 1.5m records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out of heap size before I upped that. I assume this is because during the tree reduction it's still retaining references to nodes eventually causing it to build continually larger structures? I've included the reduce function (and how that gets called to produce results) inline: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) (def results (reduce merge-tree {} (map record-to-tree postcodes-from-file))) All help much appreciated, Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
On Nov 3, 2010, at 8:31 PM, Andy Fingerhut wrote: I'd recommend changing your code to use Java chars instead of single-char Java strings. That plus the change Paul suggests below should reduce memory consumption significantly. Another option is to .intern() the strings. You'll still create lots of garbage while reading the input file, though. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
Check out assoc-in, get-in, and update-in. They make working with nested maps a breeze. Here's a rewrite of your code: (ns user (:require [clojure.string :as str] [clojure.java.io :as io])) (def postcode-trie (with-open [r (io/reader /path/to/data.csv)] (reduce (fn [trie line] (let [[postcode region] (.split line ,) postcode (str/replace postcode )] (assoc-in trie postcode region))) {} (line-seq r (get-in postcode-trie SW1A2) ;; = 20 This stores keys of the tree (trie) as characters instead of strings, which lets you use get-in easily. Using line-seq might help mitigate unnecessary memory usage, though as Andy mentions, Java objects just carry a lot of baggage. Justin On Nov 3, 12:22 pm, Paul Ingles p...@forward.co.uk wrote: Hi, I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that area. I have some code that works with medium-ish size inputs but fails with a GC Overhead error with larger input sets (1.5m rows) and would really appreciate anyone being able to point me in the right direction. The full code is up as a gist here:https://gist.github.com/661278 My input file contains something like: SW1A 1,10 SW1A 2,20 ... Which are then mapped to 2 trees: {S {W {1 {A {1 10} {S {W {1 {A {2 20} I then want to continually fold those trees into a master tree. For the 2 maps above the merged tree would be: {S {W {1 {A {1 10 2 20} I'm sure I'm missing all kinds of awesome core/contrib functions to make it more concise and would appreciate anyone pointing out alternatives also. The main problem is that it fails when my input data gets sufficiently large. On my MacBook Pro it falls over with an input set of 1.5m records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out of heap size before I upped that. I assume this is because during the tree reduction it's still retaining references to nodes eventually causing it to build continually larger structures? I've included the reduce function (and how that gets called to produce results) inline: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) (def results (reduce merge-tree {} (map record-to-tree postcodes-from-file))) All help much appreciated, Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en
Re: Constructing Nested Trees and Memory Consumption
One more suggestion: Try simply creating one giant map that maps complete postal code strings directly to the associated data values, without any tree data structure explicitly created in your code. The code will be a lot simpler, and the underlying data structure is likely to be competitive memory-wise and access-time-wise to the data structure you are currently using. You can also use a transient map [1] as long as you are content to limit yourself to a single thread doing the reading and map construction -- it will be faster than a non- transient map. I suspect it might generate somewhat less garbage needing collection while you are building it, too. [1] see http://clojure.org/transients Andy On Nov 3, 2010, at 6:31 PM, Andy Fingerhut wrote: In addition to this, note that every java.lang.String, i.e. every A in your example data structures below, requires storage for one java.lang.Object (8 bytes on 32-bit JVMs, 16 bytes on 64-bit JVMs) plus an array of java char's, where I think that is equal to the size of a java.lang.Object plus 2 bytes for each char. So A requires 8+8+2=18 bytes on a 32-bit JVM (perhaps more with padding of these objects for memory alignment purposes), or 16+16+2=34 bytes on a 64-bit JVM (again, perhaps more with padding). That is before taking into account the memory required for Clojure sets. I'd recommend changing your code to use Java chars instead of single- char Java strings. That plus the change Paul suggests below should reduce memory consumption significantly. Andy On Nov 3, 2010, at 5:26 PM, Paul Mooser wrote: I could be missing something, but since you say you're running into problems as your data gets large, I think you're possibly running into 2 things: 1. You're reading all the data into memory, and keeping it there, in the form of the lines from the file. 2. The way you're defining postcodes-from-file results in holding on to the head of the sequence, meaning parts of the sequence that aren't being used anymore can't be garbage collected. You probably want to make this into a function rather than a def, so that it gets created when you invoke the function, but not bound anywhere that causes it to be a gc root. Does that make sense? On Nov 3, 9:22 am, Paul Ingles p...@forward.co.uk wrote: Hi, I've been playing around with breaking apart a list of postal codes to be stored in a tree with leaf nodes containing information about that area. I have some code that works with medium-ish size inputs but fails with a GC Overhead error with larger input sets (1.5m rows) and would really appreciate anyone being able to point me in the right direction. The full code is up as a gist here:https://gist.github.com/661278 My input file contains something like: SW1A 1,10 SW1A 2,20 ... Which are then mapped to 2 trees: {S {W {1 {A {1 10} {S {W {1 {A {2 20} I then want to continually fold those trees into a master tree. For the 2 maps above the merged tree would be: {S {W {1 {A {1 10 2 20} I'm sure I'm missing all kinds of awesome core/contrib functions to make it more concise and would appreciate anyone pointing out alternatives also. The main problem is that it fails when my input data gets sufficiently large. On my MacBook Pro it falls over with an input set of 1.5m records (although a lot of these would be branches from the first few levels). It reports GC Overhead limit exceeded, although also ran out of heap size before I upped that. I assume this is because during the tree reduction it's still retaining references to nodes eventually causing it to build continually larger structures? I've included the reduce function (and how that gets called to produce results) inline: (defn merge-tree [tree other] (if (not (map? other)) tree (merge-with (fn [x y] (merge-tree x y)) tree other))) (def results (reduce merge-tree {} (map record-to-tree postcodes-from-file))) All help much appreciated, Paul -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en