Re: processing large text files

Alan Busby Sun, 26 Oct 2014 18:51:31 -0700

On Mon, Oct 27, 2014 at 7:10 AM, Brian Craft <craft.br...@gmail.com> wrote:


> I found iota, which looks like a good solution for the read portion of the
> problem. However I also need to process the data in the file. If I start
> with an iota/vec and need to sort it, something like
>
> (sort (iota/vec "foo"))
>

Short disclaimer: I'm the one to blame for Iota.

In this situation I've found it easier just to use GNU sort, or an external
tool, and then use Iota with the pre-sorted file. Or generate an index with
a smaller subset of the data.

Examples;
$ cat data.tsv | sort -k2 > sorted_data.tsv
or
(def data (iota/numbered-vec "data.tsv"))
(def index (->> data
                          (map (fn [line]
                                      (let [[linenum key & _]
(clojure.string/split line #"\t" -1)]
                                              [key linenum]))
                          (into (sorted-map))))

While Iota does use mmap under the hood for reading/caching, it can't
reduce memory consumption if all of the data is converted to a string. For
most "Clojurey" operations though, strings are preferred. So the trick is
to not realize the entire data set in memory and instead treat it like a
stream instead.

I'd love to add a mechanism to Iota to solve this problem, but I haven't
come upon a good solution yet.
I'm all ears though!

Hope this helps,
   TheBusby

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: processing large text files

Reply via email to