I wrote a sample code to process the English Wikipedia file dump (+- 40GB) and didn't use nothing but the core Clojure and a bzip library.
I'll put on GitHub to show you. I hope it helps. Plinio Balduino 11 982 611 487 > On 26/10/2014, at 23:51, Alan Busby <thebu...@gmail.com> wrote: > >> On Mon, Oct 27, 2014 at 7:10 AM, Brian Craft <craft.br...@gmail.com> wrote: >> I found iota, which looks like a good solution for the read portion of the >> problem. However I also need to process the data in the file. If I start >> with an iota/vec and need to sort it, something like >> >> (sort (iota/vec "foo")) > > > Short disclaimer: I'm the one to blame for Iota. > > In this situation I've found it easier just to use GNU sort, or an external > tool, and then use Iota with the pre-sorted file. Or generate an index with a > smaller subset of the data. > > Examples; > $ cat data.tsv | sort -k2 > sorted_data.tsv > or > (def data (iota/numbered-vec "data.tsv")) > (def index (->> data > (map (fn [line] > (let [[linenum key & _] > (clojure.string/split line #"\t" -1)] > [key linenum])) > (into (sorted-map)))) > > While Iota does use mmap under the hood for reading/caching, it can't reduce > memory consumption if all of the data is converted to a string. For most > "Clojurey" operations though, strings are preferred. So the trick is to not > realize the entire data set in memory and instead treat it like a stream > instead. > > I'd love to add a mechanism to Iota to solve this problem, but I haven't come > upon a good solution yet. > I'm all ears though! > > Hope this helps, > TheBusby > > > > > > -- > You received this message because you are subscribed to the Google > Groups "Clojure" group. > To post to this group, send email to clojure@googlegroups.com > Note that posts from new members are moderated - please be patient with your > first post. > To unsubscribe from this group, send email to > clojure+unsubscr...@googlegroups.com > For more options, visit this group at > http://groups.google.com/group/clojure?hl=en > --- > You received this message because you are subscribed to the Google Groups > "Clojure" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to clojure+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "Clojure" group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups "Clojure" group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.