Hi Jarrod

I have had success with the clojure-csv [1] library and processing large files 
in a lazy way (as opposed to using slurp).

[1] - clojure-csv - https://github.com/davidsantiago/clojure-csv

Here is a copy of my source code (disclaimer - this is my first Clojure program 
- so some things might not be idiomatic).

This code handles a 250MB file, 315K rows (each row has 100 columns / fields) 
really well, and can scale in terms of memory usage since it handles the file 
lazily and processes / parses each line one at a time.

See snippets of code below

(ns scripts.core
  (:gen-class))

(require '[clojure.java.io :as io]
         '[clojure-csv.core :as csv]
         '[clojure.string :as str])

(def line-count 0)

(defn parse-row [row]
  (first (csv/parse-csv row :delimiter \tab)))

(defn parse-file [filename]
  (with-open [file (io/reader filename)]
    (doseq [line (line-seq file)]
      (let [record (parse-row line)]
        (println record)) ;; replace println record with your own logic
      (def line-count (inc line-count)))))

(defn process-file [filename]
  (do
    (def line-count 0)
    (parse-file filename)
    (println line-count)))

(defn -main [& args]
  (process-file (first args)))

Feel free to ask questions if you need more info.

Kind regards

Rudi

On 21/01/2014, at 5:55 PM, Jarrod Swart <jcsw...@gmail.com> wrote:

> I'm processing a large csv with Clojure, honestly not even that big (~18k 
> rows, 11mb).  I have a list of exported data from a client and I am 
> de-duplicating URLs within the list.  My final output is a series of vectors: 
> [url url-hash].
> 
> The odd thing is how slow it seems to be going.  I have tried implementing 
> this as a reduce, and finally I thought to speed things up I might try a 
> "with-open and a loop-recur".  It doesn't seem to have done much in my case.  
> I know I am doing something wrong I'm just not sure what yet.  The best I can 
> do is about 4 seconds, which may only seem slow because I implemented it in 
> python first and it takes a half second to finish.  Still this is one of the 
> smaller files I will likely deal with so I'm worried that as the files grow 
> it may get too slow.
> 
> The code is here on ref-heap for easy viewing: https://www.refheap.com/26098
> 
> Any advice is appreciated.
> 
> -- 
> -- 
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your 
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> --- 
> You received this message because you are subscribed to the Google Groups 
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to