I wrote a sample code to process the English Wikipedia file dump (+- 40GB) and 
didn't use nothing but the core Clojure and a bzip library. 

I'll put on GitHub to show you. I hope it helps. 

Plinio Balduino
11 982 611 487

> On 26/10/2014, at 23:51, Alan Busby <thebu...@gmail.com> wrote:
> 
>> On Mon, Oct 27, 2014 at 7:10 AM, Brian Craft <craft.br...@gmail.com> wrote:
>> I found iota, which looks like a good solution for the read portion of the 
>> problem. However I also need to process the data in the file. If I start 
>> with an iota/vec and need to sort it, something like
>> 
>> (sort (iota/vec "foo"))
> 
> 
> Short disclaimer: I'm the one to blame for Iota.
> 
> In this situation I've found it easier just to use GNU sort, or an external 
> tool, and then use Iota with the pre-sorted file. Or generate an index with a 
> smaller subset of the data.
> 
> Examples;
> $ cat data.tsv | sort -k2 > sorted_data.tsv
> or
> (def data (iota/numbered-vec "data.tsv"))
> (def index (->> data 
>                           (map (fn [line] 
>                                       (let [[linenum key & _] 
> (clojure.string/split line #"\t" -1)]
>                                               [key linenum]))
>                           (into (sorted-map))))
> 
> While Iota does use mmap under the hood for reading/caching, it can't reduce 
> memory consumption if all of the data is converted to a string. For most 
> "Clojurey" operations though, strings are preferred. So the trick is to not 
> realize the entire data set in memory and instead treat it like a stream 
> instead.
> 
> I'd love to add a mechanism to Iota to solve this problem, but I haven't come 
> upon a good solution yet. 
> I'm all ears though!
> 
> Hope this helps,
>    TheBusby
> 
> 
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to