Re: Working with big datasets, merging two ordered lists by key

Frank Behrens Mon, 10 Mar 2014 01:53:24 -0700

Hey, just to share, I came up with this code, which seem quite ok to me,
Feels like I already understand something, do i,
Have a nice day, Frank


(loop
  [a '(1 2 3 4)
   b '(1 3)
   out ()]
  (cond 
    (and (empty? a)(empty? b)) out
    (empty? a)                 (recur a (rest b) (conj out [nil (first 
b)]))   
    (empty? b)                 (recur (rest a)  b (conj out [(first a) 
nil]))
    :else (let
            [fa   (first a)
             fb   (first b)
             cmp  (compare fa fb)]
            (cond 
                (= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
                (> 0 cmp) (recur (rest a)  b       (conj out [fa nil]))
                :else     (recur  a       (rest b) (conj out [nil fb]))))))


Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:
>
> Thanks for your suggestions. 
> a for loop has to do  100.000 * 300.000 compares
> Storing the database table into a 300.000 element hash, would be a memory 
> penalty I want to avoid.
>
> I'm quite shure that assential part of the solution is a function to 
> iterate through both list at once,
> spitting out pairs of values according to compare
>
> (merge-sortedlists 
>   '(1 2 3)
>   '(   2    4))
> => ([1 nil] [2 2] [3 nil] [nil 4])
>
> Seems quite doable.
> Try to implement now.
>
> Frank
>
>
> Am Montag, 10. März 2014 01:23:57 UTC+1 schrieb frye:
>>
>> Hmm, the *for* comprehension yields a lazy sequence of results. So the 
>> penalty should only occur when one starts to use / evaluate the result. 
>> Using maps is a good idea. But I think you'll have to use another algorithm 
>> (not *for*) to get the random access you seek. 
>>
>> Frank could try a *clojure.set/intersection* to find common keys between 
>> the lists. then *order* and *map* / *merge* the 2 lists. 
>>
>> Beyond that, I can't see a scenario where some iteration won't have to 
>> search the space for matching keys (which I think 
>> *clojure.set/intersection* does). A fair point all the same. 
>>
>>
>> Tim Washington 
>> Interruptsoftware.com <http://interruptsoftware.com> 
>>
>>
>> On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich <mor...@tarn-vedra.de>wrote:
>>
>>> I think it would be more efficient to read one of the inputs into a
>>> map for random access instead of iterating it every time.
>>>
>>> On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington <twas...@gmail.com> 
>>> wrote:
>>> > Hey Frank,
>>> >
>>> > Try opening up a repl, and running this for comprehension.
>>> >
>>> > (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
>>> > (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] 
>>> [:id4
>>> > {:age 60}]])
>>> >
>>> > (for [i user_textfile
>>> >         j user_database
>>> >         :when (= (first i) (first j))]
>>> >     {(first i) (merge (second i) (second j))})
>>> >
>>> > ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; 
>>> result
>>> > from repl
>>> >
>>> >
>>> >
>>> > Hth
>>> >
>>> > Tim Washington
>>> > Interruptsoftware.com
>>> >
>>> >
>>> > On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens <fbeh...@gmail.com> 
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> i'm investigating if clojure can be used to solve the challenges and
>>> >> problems we have at my day job better than ruby or powershell. A very 
>>> common
>>> >> use case is validating data from different  systems against some 
>>> criteria. i
>>> >> believe clojure can be our silver bullet, but before that, it seems 
>>> to be
>>> >> required to wrap my head around it.
>>> >>
>>> >> So I am starting in the first level with the challenge to validate 
>>> some
>>> >> data from the user database against our active directory.
>>> >>
>>> >> I already have all the parts to make it work: Which is to make a hash 
>>> by
>>> >> user_id from the database table, export a textfile from AD, each line
>>> >> representing a user, parse it, merge the information from the
>>> >> user_table_hash, and voila.
>>> >>
>>> >> I did not finish to implement this. So I don't know if this naive 
>>> approach
>>> >> will work with 400.000 records in the user database and 100.000 in the
>>> >> textfile.
>>> >> But I already think about how I could implement this in a more memory
>>> >> efficient way.
>>> >>
>>> >> So my simple question:
>>> >>
>>> >> I have user_textfile (100.000 records) which can be parsed into a
>>> >> unordered list of user-maps.
>>> >> I have user_table in the database(400.000 record) which I can query 
>>> with
>>> >> order and gives me an ordered list of user-maps.
>>> >>
>>> >> So I would first order the user_textfile and then conj the user_table
>>> >> ordered list into it, while doing the database query.
>>> >> Is that approach right ? How would I then merge the two ordered lists 
>>> like
>>> >> in the example below?
>>> >>
>>> >> (defn user_textfile
>>> >>   ([:id1 {:name 'Frank'}]
>>> >>    [:id3 {:name 'Tim'}]))
>>> >>
>>> >> (defn user_database
>>> >>   ([:id1 {:age 38}]
>>> >>    [:id2 {:age 27}]
>>> >>    [:id3 {:age 18}]
>>> >>    [:id4 {:age 60}]))
>>> >>
>>> >> (merge-sorted-lists user_database user_textfile)
>>> >> =>
>>> >>   ([:id1 {:name 'Frank' :age 38}]
>>> >>    [:id3 {:name 'Tim'   :age 18}]))
>>> >>
>>> >> Any feedback is appreciated.
>>> >> Have a nice day,
>>> >> Frank
>>> >>
>>> >> --
>>> >> You received this message because you are subscribed to the Google
>>> >> Groups "Clojure" group.
>>> >> To post to this group, send email to clo...@googlegroups.com
>>> >> Note that posts from new members are moderated - please be patient 
>>> with
>>> >> your first post.
>>> >> To unsubscribe from this group, send email to
>>> >> clojure+u...@googlegroups.com
>>> >> For more options, visit this group at
>>> >> http://groups.google.com/group/clojure?hl=en
>>> >> ---
>>> >> You received this message because you are subscribed to the Google 
>>> Groups
>>> >> "Clojure" group.
>>> >> To unsubscribe from this group and stop receiving emails from it, 
>>> send an
>>> >> email to clojure+u...@googlegroups.com.
>>> >> For more options, visit https://groups.google.com/d/optout.
>>> >
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> > Groups "Clojure" group.
>>> > To post to this group, send email to clo...@googlegroups.com
>>> > Note that posts from new members are moderated - please be patient 
>>> with your
>>> > first post.
>>> > To unsubscribe from this group, send email to
>>> > clojure+u...@googlegroups.com
>>> > For more options, visit this group at
>>> > http://groups.google.com/group/clojure?hl=en
>>> > ---
>>> > You received this message because you are subscribed to the Google 
>>> Groups
>>> > "Clojure" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>> an
>>> > email to clojure+u...@googlegroups.com.
>>> > For more options, visit https://groups.google.com/d/optout.
>>>
>>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

Reply via email to