Re: [Large File Processing] What am I doing wrong?

2014-01-27 Thread Curtis Gagliardi
If ordering isn't important, I'd just dump them all into a set instead of manually checking whether or or not you already put the url into a set. On Sunday, January 26, 2014 10:46:46 PM UTC-8, danneu wrote: I use line-seq, split, and destructuring to parse large CSVs. Here's how I'd

Re: [Large File Processing] What am I doing wrong?

2014-01-26 Thread danneu
I use line-seq, split, and destructuring to parse large CSVs. Here's how I'd approach what I think you're trying to do: (with-open [rdr (io/reader (io/resource csv :encoding UTF-16))] (let [extract-url-hash (fn [line] (let [[_ _ _ url _] (str/split

Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Rudi Engelbrecht
Hi Jarrod I have had success with the clojure-csv [1] library and processing large files in a lazy way (as opposed to using slurp). [1] - clojure-csv - https://github.com/davidsantiago/clojure-csv Here is a copy of my source code (disclaimer - this is my first Clojure program - so some things

Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Chris Perkins
On Monday, January 20, 2014 11:55:00 PM UTC-7, Jarrod Swart wrote: I'm processing a large csv with Clojure, honestly not even that big (~18k rows, 11mb). I have a list of exported data from a client and I am de-duplicating URLs within the list. My final output is a series of vectors:

Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Jim - FooBar();
On 21/01/14 13:11, Chris Perkins wrote: This part: (some #{hashed} already-seen) is doing a linear lookup in `already-seen`. Try (contains? already-seen hashed) instead. +1 to that as it will become faster... I would also add the following not so related to performance: (drop1 (line-seqf))

Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Michael Gardner
On Jan 21, 2014, at 07:11 , Chris Perkins chrisperkin...@gmail.com wrote: This part: (some #{hashed} already-seen) is doing a linear lookup in `already-seen`. Try (contains? already-seen hashed) instead. Or just (already-seen hashed), given that OP's not trying to store nil hashes. To OP:

Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Jarrod Swart
Chris, Thanks this was in fact it. I had read that sets had a near O[1] lookup, but apparently I was not achieving this properly with (some). Thank you the execution time is about 25x faster now! Jarrod On Tuesday, January 21, 2014 8:11:09 AM UTC-5, Chris Perkins wrote: On Monday, January

Re: [Large File Processing] What am I doing wrong?

2014-01-21 Thread Jarrod Swart
Jim, Thanks for the idioms, I appreciate it! And thanks everyone for the help! On Tuesday, January 21, 2014 8:43:40 AM UTC-5, Jim foo.bar wrote: On 21/01/14 13:11, Chris Perkins wrote: This part: (some #{hashed} already-seen) is doing a linear lookup in `already-seen`. Try (contains?

[Large File Processing] What am I doing wrong?

2014-01-20 Thread Jarrod Swart
I'm processing a large csv with Clojure, honestly not even that big (~18k rows, 11mb). I have a list of exported data from a client and I am de-duplicating URLs within the list. My final output is a series of vectors: [url url-hash]. The odd thing is how slow it seems to be going. I have