Re: Working with big datasets, merging two ordered lists by key

2014-03-14 Thread Frank Behrens
I am still working on the solution, (see gist ) and want to share my current thoughts. The problem is to process over a join on two big datasets (from different sources). Right now I a quite confident as I break the problem into smaller parts, and I am starting

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Leif
Re. Tim's points below: *i)* The seqs have to be ordered, or one of them has to be loaded fully into memory; I don't think there's any way around that. *ii)* Frank's solution does *not* require the seqs to be the same length, and it gives you the complete 'diff' of the seqs (aka outer join), w

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Timothy Washington
Hey Frank, Right. So I tried this loop / recur, and it runs, giving a result of *([4 nil] [3 3] [2 nil] [1 1])*. But I'm not sure how that's going to help you (although not discounting the possibility). You can simultaneously iterate through pairs of lists, to compare values. However you cannot g

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Frank Behrens
Hey, just to share, I came up with this code, which seem quite ok to me, Feels like I already understand something, do i, Have a nice day, Frank (loop [a '(1 2 3 4) b '(1 3) out ()] (cond (and (empty? a)(empty? b)) out (empty? a) (recur a (rest b) (conj out [nil

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Frank Behrens
Thanks for your suggestions. a for loop has to do 100.000 * 300.000 compares Storing the database table into a 300.000 element hash, would be a memory penalty I want to avoid. I'm quite shure that assential part of the solution is a function to iterate through both list at once, spitting out p

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Timothy Washington
Hmm, the *for* comprehension yields a lazy sequence of results. So the penalty should only occur when one starts to use / evaluate the result. Using maps is a good idea. But I think you'll have to use another algorithm (not *for*) to get the random access you seek. Frank could try a *clojure.set/i

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Moritz Ulrich
I think it would be more efficient to read one of the inputs into a map for random access instead of iterating it every time. On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington wrote: > Hey Frank, > > Try opening up a repl, and running this for comprehension. > > (def user_textfile [[:id1 {:name

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Timothy Washington
Hey Frank, Try opening up a repl, and running this *for* comprehension. (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]]) (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] [:id4 {:age 60}]]) (for [i user_textfile j user_database :when (= (firs

Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Frank Behrens
Hi, i'm investigating if clojure can be used to solve the challenges and problems we have at my day job better than ruby or powershell. A very common use case is validating data from different systems against some criteria. i believe clojure can be our silver bullet, but before that, it seems