If ordering isn't important, I'd just dump them all into a set instead of
manually checking whether or or not you already put the url into a set.
On Sunday, January 26, 2014 10:46:46 PM UTC-8, danneu wrote:
I use line-seq, split, and destructuring to parse large CSVs.
Here's how I'd
I use line-seq, split, and destructuring to parse large CSVs.
Here's how I'd approach what I think you're trying to do:
(with-open [rdr (io/reader (io/resource csv :encoding UTF-16))]
(let [extract-url-hash (fn [line]
(let [[_ _ _ url _] (str/split
Hi Jarrod
I have had success with the clojure-csv [1] library and processing large files
in a lazy way (as opposed to using slurp).
[1] - clojure-csv - https://github.com/davidsantiago/clojure-csv
Here is a copy of my source code (disclaimer - this is my first Clojure program
- so some things
On Monday, January 20, 2014 11:55:00 PM UTC-7, Jarrod Swart wrote:
I'm processing a large csv with Clojure, honestly not even that big (~18k
rows, 11mb). I have a list of exported data from a client and I am
de-duplicating URLs within the list. My final output is a series of
vectors:
On 21/01/14 13:11, Chris Perkins wrote:
This part: (some #{hashed} already-seen) is doing a linear lookup in
`already-seen`. Try (contains? already-seen hashed) instead.
+1 to that as it will become faster...
I would also add the following not so related to performance:
(drop1 (line-seqf))
On Jan 21, 2014, at 07:11 , Chris Perkins chrisperkin...@gmail.com wrote:
This part: (some #{hashed} already-seen) is doing a linear lookup in
`already-seen`. Try (contains? already-seen hashed) instead.
Or just (already-seen hashed), given that OP's not trying to store nil hashes.
To OP:
Chris,
Thanks this was in fact it. I had read that sets had a near O[1] lookup,
but apparently I was not achieving this properly with (some). Thank you
the execution time is about 25x faster now!
Jarrod
On Tuesday, January 21, 2014 8:11:09 AM UTC-5, Chris Perkins wrote:
On Monday, January
Jim,
Thanks for the idioms, I appreciate it!
And thanks everyone for the help!
On Tuesday, January 21, 2014 8:43:40 AM UTC-5, Jim foo.bar wrote:
On 21/01/14 13:11, Chris Perkins wrote:
This part: (some #{hashed} already-seen) is doing a linear lookup in
`already-seen`. Try (contains?
I'm processing a large csv with Clojure, honestly not even that big (~18k
rows, 11mb). I have a list of exported data from a client and I am
de-duplicating URLs within the list. My final output is a series of
vectors: [url url-hash].
The odd thing is how slow it seems to be going. I have