subject:"\"Re\\\: Working with big datasets, merging two ordered lists by key\""

Re: Working with big datasets, merging two ordered lists by key

2014-03-14 Thread Frank Behrens

I am still working on the solution, (see gist
) and want to share my current thoughts.

The problem is to process over a join on two big datasets (from different 
sources). 
Right now I a quite confident as I break the problem into smaller parts, 
and I am starting to see, how this is very easy in clojure.
1) I have to bring both datasets (lists) into a nice form: [ id 
{attributes}] might be a good fit
2) because they are sorted and the id is unique (right now) , with my 
merge-sorted 
function, i can pull the records from the list, compare them with a 
function (defaults to identity in the upper case) and pair them up.
3) from the resulting list of pair i can filter the records, which I am 
interested in, and 
4) do my processing over them.

This approach seems simple, and flexible to me, would be very useful for 
different problems we have at our big enterprise.

I am close to putting the parts together, and will then see, how this fits 
in memory,
and if it solves my current problem.

But im my newbie clojure dreams, i could imagine to get this done in a lazy 
fashion.

Can my clojureCLR databasequery, sorted textfile, merging, filtering and 
processing all be lazy


  

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Leif

Re. Tim's points below:

*i)* The seqs have to be ordered, or one of them has to be loaded fully 
into memory; I don't think there's any way around that.

*ii)* Frank's solution does *not* require the seqs to be the same length, 
and it gives you the complete 'diff' of the seqs (aka outer join), which 
could be handy.  The one snag I see is that it is eager, not lazy, so it's 
going to put the answer completely in memory.  So unless you are projecting 
out a small subset of the fields from each record, you will probably end up 
using as much memory as the other solutions.  I wrote a lazy version using 
'iterate', but I'm not sure it doesn't keep both entire seqs in memory, too.

My two cents:

1. If you have enough memory, go with Moritz' suggestion to read the 
smaller seq into a map.  Then you can do a simple for comprehension and 
arrange it so that the second, larger seq will never be completely in 
memory.
2. Another possible solution is to load the textfile into a temp table in 
your database.  Then the solution is one simple SQL query, backed by 
hyper-optimized code designed to deal with this exact problem.
3. You may want to try the naive approach: 400k records sounds like it 
could very well fit into memory, as long as each record doesn't have a huge 
amount of data.
4. A library that has tools to deal with big files: 
https://github.com/kyleburton/clj-etl-utils

--Leif

On Monday, March 10, 2014 11:01:07 PM UTC-4, frye wrote:
>
> Hey Frank, 
>
> Right. So I tried this loop / recur, and it runs, giving a result of *([4 
> nil] [3 3] [2 nil] [1 1])*. But I'm not sure how that's going to help you 
> (although not discounting the possibility). 
>
> You can simultaneously iterate through pairs of lists, to compare values. 
> However you cannot guarantee that those lists will be *i)* ordered, and 
> *ii)* the same length. Both those conditions are required for your 
> algorithm to work. Plus, what you suggest still means that you'll have to 
> scan through the entire space of both results. So we're not going to avoid 
> that. 
>
> Based on your requirements, I still see my original *for* comprehension 
> as the most straightforward way to solve the problem. My second suggested 
> algorithm could also work. But I could be wrong and am always learning too. 
> So trying different solutions is a good habit to keep. 
>
>
> Hth 
>
> Tim Washington 
> Interruptsoftware.com  
>
>  
> On Mon, Mar 10, 2014 at 4:53 AM, Frank Behrens 
> > wrote:
>
>> Hey, just to share, I came up with this code, which seem quite ok to me,
>> Feels like I already understand something, do i,
>> Have a nice day, Frank
>>
>> (loop
>>   [a '(1 2 3 4)
>>b '(1 3)
>>out ()]
>>   (cond 
>> (and (empty? a)(empty? b)) out
>> (empty? a) (recur a (rest b) (conj out [nil (first 
>> b)]))   
>> (empty? b) (recur (rest a)  b (conj out [(first a) 
>> nil]))
>> :else (let
>> [fa   (first a)
>>  fb   (first b)
>>  cmp  (compare fa fb)]
>> (cond 
>> (= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
>> (> 0 cmp) (recur (rest a)  b   (conj out [fa nil]))
>> :else (recur  a   (rest b) (conj out [nil 
>> fb]))
>>
>>
>> Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:
>>
>>> Thanks for your suggestions. 
>>> a for loop has to do  100.000 * 300.000 compares
>>> Storing the database table into a 300.000 element hash, would be a 
>>> memory penalty I want to avoid.
>>>
>>> I'm quite shure that assential part of the solution is a function to 
>>> iterate through both list at once,
>>> spitting out pairs of values according to compare
>>>
>>> (merge-sortedlists 
>>>   '(1 2 3)
>>>   '(   24))
>>> => ([1 nil] [2 2] [3 nil] [nil 4])
>>>
>>> Seems quite doable.
>>> Try to implement now.
>>>
>>> Frank
>>>
>>> 

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Timothy Washington

Hey Frank,

Right. So I tried this loop / recur, and it runs, giving a result of *([4
nil] [3 3] [2 nil] [1 1])*. But I'm not sure how that's going to help you
(although not discounting the possibility).

You can simultaneously iterate through pairs of lists, to compare values.
However you cannot guarantee that those lists will be *i)* ordered, and
*ii)* the same length. Both those conditions are required for your
algorithm to work. Plus, what you suggest still means that you'll have to
scan through the entire space of both results. So we're not going to avoid
that.

Based on your requirements, I still see my original *for* comprehension as
the most straightforward way to solve the problem. My second suggested
algorithm could also work. But I could be wrong and am always learning too.
So trying different solutions is a good habit to keep.

Hth

Tim Washington
Interruptsoftware.com 

On Mon, Mar 10, 2014 at 4:53 AM, Frank Behrens  wrote:

> Hey, just to share, I came up with this code, which seem quite ok to me,
> Feels like I already understand something, do i,
> Have a nice day, Frank
>
> (loop
>   [a '(1 2 3 4)
>b '(1 3)
>out ()]
>   (cond
> (and (empty? a)(empty? b)) out
> (empty? a) (recur a (rest b) (conj out [nil (first
> b)]))
> (empty? b) (recur (rest a)  b (conj out [(first a)
> nil]))
> :else (let
> [fa   (first a)
>  fb   (first b)
>  cmp  (compare fa fb)]
> (cond
> (= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
> (> 0 cmp) (recur (rest a)  b   (conj out [fa nil]))
> :else (recur  a   (rest b) (conj out [nil fb]))
>
>
> Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:
>
>> Thanks for your suggestions.
>> a for loop has to do  100.000 * 300.000 compares
>> Storing the database table into a 300.000 element hash, would be a memory
>> penalty I want to avoid.
>>
>> I'm quite shure that assential part of the solution is a function to
>> iterate through both list at once,
>> spitting out pairs of values according to compare
>>
>> (merge-sortedlists
>>   '(1 2 3)
>>   '(   24))
>> => ([1 nil] [2 2] [3 nil] [nil 4])
>>
>> Seems quite doable.
>> Try to implement now.
>>
>> Frank
>>
>>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Frank Behrens

Hey, just to share, I came up with this code, which seem quite ok to me,
Feels like I already understand something, do i,
Have a nice day, Frank

(loop
  [a '(1 2 3 4)
   b '(1 3)
   out ()]
  (cond 
(and (empty? a)(empty? b)) out
(empty? a) (recur a (rest b) (conj out [nil (first 
b)]))   
(empty? b) (recur (rest a)  b (conj out [(first a) 
nil]))
:else (let
[fa   (first a)
 fb   (first b)
 cmp  (compare fa fb)]
(cond 
(= 0 cmp) (recur (rest a) (rest b) (conj out [fa fb]))
(> 0 cmp) (recur (rest a)  b   (conj out [fa nil]))
:else (recur  a   (rest b) (conj out [nil fb]))


Am Montag, 10. März 2014 09:26:14 UTC+1 schrieb Frank Behrens:
>
> Thanks for your suggestions. 
> a for loop has to do  100.000 * 300.000 compares
> Storing the database table into a 300.000 element hash, would be a memory 
> penalty I want to avoid.
>
> I'm quite shure that assential part of the solution is a function to 
> iterate through both list at once,
> spitting out pairs of values according to compare
>
> (merge-sortedlists 
>   '(1 2 3)
>   '(   24))
> => ([1 nil] [2 2] [3 nil] [nil 4])
>
> Seems quite doable.
> Try to implement now.
>
> Frank
>
>
> Am Montag, 10. März 2014 01:23:57 UTC+1 schrieb frye:
>>
>> Hmm, the *for* comprehension yields a lazy sequence of results. So the 
>> penalty should only occur when one starts to use / evaluate the result. 
>> Using maps is a good idea. But I think you'll have to use another algorithm 
>> (not *for*) to get the random access you seek. 
>>
>> Frank could try a *clojure.set/intersection* to find common keys between 
>> the lists. then *order* and *map* / *merge* the 2 lists. 
>>
>> Beyond that, I can't see a scenario where some iteration won't have to 
>> search the space for matching keys (which I think 
>> *clojure.set/intersection* does). A fair point all the same. 
>>
>>
>> Tim Washington 
>> Interruptsoftware.com  
>>
>>
>> On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich wrote:
>>
>>> I think it would be more efficient to read one of the inputs into a
>>> map for random access instead of iterating it every time.
>>>
>>> On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington  
>>> wrote:
>>> > Hey Frank,
>>> >
>>> > Try opening up a repl, and running this for comprehension.
>>> >
>>> > (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
>>> > (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] 
>>> [:id4
>>> > {:age 60}]])
>>> >
>>> > (for [i user_textfile
>>> > j user_database
>>> > :when (= (first i) (first j))]
>>> > {(first i) (merge (second i) (second j))})
>>> >
>>> > ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; 
>>> result
>>> > from repl
>>> >
>>> >
>>> >
>>> > Hth
>>> >
>>> > Tim Washington
>>> > Interruptsoftware.com
>>> >
>>> >
>>> > On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens  
>>> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> i'm investigating if clojure can be used to solve the challenges and
>>> >> problems we have at my day job better than ruby or powershell. A very 
>>> common
>>> >> use case is validating data from different  systems against some 
>>> criteria. i
>>> >> believe clojure can be our silver bullet, but before that, it seems 
>>> to be
>>> >> required to wrap my head around it.
>>> >>
>>> >> So I am starting in the first level with the challenge to validate 
>>> some
>>> >> data from the user database against our active directory.
>>> >>
>>> >> I already have all the parts to make it work: Which is to make a hash 
>>> by
>>> >> user_id from the database table, export a textfile from AD, each line
>>> >> representing a user, parse it, merge the information from the
>>> >> user_table_hash, and voila.
>>> >>
>>> >> I did not finish to implement this. So I don't know if this naive 
>>> approach
>>> >> will work with 400.000 records in the user database and 100.000 in the
>>> >> textfile.
>>> >> But I already think about how I could implement this in a more memory
>>> >> efficient way.
>>> >>
>>> >> So my simple question:
>>> >>
>>> >> I have user_textfile (100.000 records) which can be parsed into a
>>> >> unordered list of user-maps.
>>> >> I have user_table in the database(400.000 record) which I can query 
>>> with
>>> >> order and gives me an ordered list of user-maps.
>>> >>
>>> >> So I would first order the user_textfile and then conj the user_table
>>> >> ordered list into it, while doing the database query.
>>> >> Is that approach right ? How would I then merge the two ordered lists 
>>> like
>>> >> in the example below?
>>> >>
>>> >> (defn user_textfile
>>> >>   ([:id1 {:name 'Frank'}]
>>> >>[:id3 {:name 'Tim'}]))
>>> >>
>>> >> (defn user_database
>>> >>   ([:id1 {:age 38}]
>>> >>[:id2 {:age 27}]
>>> >>[:id3 {:age 18}]
>>> >>[:id4 {:age 60}]))
>>> >>
>>> >> (merge-sorted-lis

Re: Working with big datasets, merging two ordered lists by key

2014-03-10 Thread Frank Behrens

Thanks for your suggestions. 
a for loop has to do  100.000 * 300.000 compares
Storing the database table into a 300.000 element hash, would be a memory 
penalty I want to avoid.

I'm quite shure that assential part of the solution is a function to 
iterate through both list at once,
spitting out pairs of values according to compare

(merge-sortedlists 
  '(1 2 3)
  '(   24))
=> ([1 nil] [2 2] [3 nil] [nil 4])

Seems quite doable.
Try to implement now.

Frank


Am Montag, 10. März 2014 01:23:57 UTC+1 schrieb frye:
>
> Hmm, the *for* comprehension yields a lazy sequence of results. So the 
> penalty should only occur when one starts to use / evaluate the result. 
> Using maps is a good idea. But I think you'll have to use another algorithm 
> (not *for*) to get the random access you seek. 
>
> Frank could try a *clojure.set/intersection* to find common keys between 
> the lists. then *order* and *map* / *merge* the 2 lists. 
>
> Beyond that, I can't see a scenario where some iteration won't have to 
> search the space for matching keys (which I think 
> *clojure.set/intersection* does). A fair point all the same. 
>
>
> Tim Washington 
> Interruptsoftware.com  
>
>
> On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich 
> 
> > wrote:
>
>> I think it would be more efficient to read one of the inputs into a
>> map for random access instead of iterating it every time.
>>
>> On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington 
>> > 
>> wrote:
>> > Hey Frank,
>> >
>> > Try opening up a repl, and running this for comprehension.
>> >
>> > (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
>> > (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] 
>> [:id4
>> > {:age 60}]])
>> >
>> > (for [i user_textfile
>> > j user_database
>> > :when (= (first i) (first j))]
>> > {(first i) (merge (second i) (second j))})
>> >
>> > ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
>> > from repl
>> >
>> >
>> >
>> > Hth
>> >
>> > Tim Washington
>> > Interruptsoftware.com
>> >
>> >
>> > On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens 
>> > > 
>> wrote:
>> >>
>> >> Hi,
>> >>
>> >> i'm investigating if clojure can be used to solve the challenges and
>> >> problems we have at my day job better than ruby or powershell. A very 
>> common
>> >> use case is validating data from different  systems against some 
>> criteria. i
>> >> believe clojure can be our silver bullet, but before that, it seems to 
>> be
>> >> required to wrap my head around it.
>> >>
>> >> So I am starting in the first level with the challenge to validate some
>> >> data from the user database against our active directory.
>> >>
>> >> I already have all the parts to make it work: Which is to make a hash 
>> by
>> >> user_id from the database table, export a textfile from AD, each line
>> >> representing a user, parse it, merge the information from the
>> >> user_table_hash, and voila.
>> >>
>> >> I did not finish to implement this. So I don't know if this naive 
>> approach
>> >> will work with 400.000 records in the user database and 100.000 in the
>> >> textfile.
>> >> But I already think about how I could implement this in a more memory
>> >> efficient way.
>> >>
>> >> So my simple question:
>> >>
>> >> I have user_textfile (100.000 records) which can be parsed into a
>> >> unordered list of user-maps.
>> >> I have user_table in the database(400.000 record) which I can query 
>> with
>> >> order and gives me an ordered list of user-maps.
>> >>
>> >> So I would first order the user_textfile and then conj the user_table
>> >> ordered list into it, while doing the database query.
>> >> Is that approach right ? How would I then merge the two ordered lists 
>> like
>> >> in the example below?
>> >>
>> >> (defn user_textfile
>> >>   ([:id1 {:name 'Frank'}]
>> >>[:id3 {:name 'Tim'}]))
>> >>
>> >> (defn user_database
>> >>   ([:id1 {:age 38}]
>> >>[:id2 {:age 27}]
>> >>[:id3 {:age 18}]
>> >>[:id4 {:age 60}]))
>> >>
>> >> (merge-sorted-lists user_database user_textfile)
>> >> =>
>> >>   ([:id1 {:name 'Frank' :age 38}]
>> >>[:id3 {:name 'Tim'   :age 18}]))
>> >>
>> >> Any feedback is appreciated.
>> >> Have a nice day,
>> >> Frank
>> >>
>> >> --
>> >> You received this message because you are subscribed to the Google
>> >> Groups "Clojure" group.
>> >> To post to this group, send email to clo...@googlegroups.com
>> >> Note that posts from new members are moderated - please be patient with
>> >> your first post.
>> >> To unsubscribe from this group, send email to
>> >> clojure+u...@googlegroups.com 
>> >> For more options, visit this group at
>> >> http://groups.google.com/group/clojure?hl=en
>> >> ---
>> >> You received this message because you are subscribed to the Google 
>> Groups
>> >> "Clojure" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send 
>> an
>> >> email to clojure+u...@googlegroups.com .
>> >> For more optio

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Timothy Washington

Hmm, the *for* comprehension yields a lazy sequence of results. So the
penalty should only occur when one starts to use / evaluate the result.
Using maps is a good idea. But I think you'll have to use another algorithm
(not *for*) to get the random access you seek.

Frank could try a *clojure.set/intersection* to find common keys between
the lists. then *order* and *map* / *merge* the 2 lists.

Beyond that, I can't see a scenario where some iteration won't have to
search the space for matching keys (which I think
*clojure.set/intersection* does).
A fair point all the same.


Tim Washington
Interruptsoftware.com 


On Sun, Mar 9, 2014 at 12:13 PM, Moritz Ulrich  wrote:

> I think it would be more efficient to read one of the inputs into a
> map for random access instead of iterating it every time.
>
> On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington 
> wrote:
> > Hey Frank,
> >
> > Try opening up a repl, and running this for comprehension.
> >
> > (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
> > (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}]
> [:id4
> > {:age 60}]])
> >
> > (for [i user_textfile
> > j user_database
> > :when (= (first i) (first j))]
> > {(first i) (merge (second i) (second j))})
> >
> > ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
> > from repl
> >
> >
> >
> > Hth
> >
> > Tim Washington
> > Interruptsoftware.com
> >
> >
> > On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens 
> wrote:
> >>
> >> Hi,
> >>
> >> i'm investigating if clojure can be used to solve the challenges and
> >> problems we have at my day job better than ruby or powershell. A very
> common
> >> use case is validating data from different  systems against some
> criteria. i
> >> believe clojure can be our silver bullet, but before that, it seems to
> be
> >> required to wrap my head around it.
> >>
> >> So I am starting in the first level with the challenge to validate some
> >> data from the user database against our active directory.
> >>
> >> I already have all the parts to make it work: Which is to make a hash by
> >> user_id from the database table, export a textfile from AD, each line
> >> representing a user, parse it, merge the information from the
> >> user_table_hash, and voila.
> >>
> >> I did not finish to implement this. So I don't know if this naive
> approach
> >> will work with 400.000 records in the user database and 100.000 in the
> >> textfile.
> >> But I already think about how I could implement this in a more memory
> >> efficient way.
> >>
> >> So my simple question:
> >>
> >> I have user_textfile (100.000 records) which can be parsed into a
> >> unordered list of user-maps.
> >> I have user_table in the database(400.000 record) which I can query with
> >> order and gives me an ordered list of user-maps.
> >>
> >> So I would first order the user_textfile and then conj the user_table
> >> ordered list into it, while doing the database query.
> >> Is that approach right ? How would I then merge the two ordered lists
> like
> >> in the example below?
> >>
> >> (defn user_textfile
> >>   ([:id1 {:name 'Frank'}]
> >>[:id3 {:name 'Tim'}]))
> >>
> >> (defn user_database
> >>   ([:id1 {:age 38}]
> >>[:id2 {:age 27}]
> >>[:id3 {:age 18}]
> >>[:id4 {:age 60}]))
> >>
> >> (merge-sorted-lists user_database user_textfile)
> >> =>
> >>   ([:id1 {:name 'Frank' :age 38}]
> >>[:id3 {:name 'Tim'   :age 18}]))
> >>
> >> Any feedback is appreciated.
> >> Have a nice day,
> >> Frank
> >>
> >> --
> >> You received this message because you are subscribed to the Google
> >> Groups "Clojure" group.
> >> To post to this group, send email to clojure@googlegroups.com
> >> Note that posts from new members are moderated - please be patient with
> >> your first post.
> >> To unsubscribe from this group, send email to
> >> clojure+unsubscr...@googlegroups.com
> >> For more options, visit this group at
> >> http://groups.google.com/group/clojure?hl=en
> >> ---
> >> You received this message because you are subscribed to the Google
> Groups
> >> "Clojure" group.
> >> To unsubscribe from this group and stop receiving emails from it, send
> an
> >> email to clojure+unsubscr...@googlegroups.com.
> >> For more options, visit https://groups.google.com/d/optout.
> >
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "Clojure" group.
> > To post to this group, send email to clojure@googlegroups.com
> > Note that posts from new members are moderated - please be patient with
> your
> > first post.
> > To unsubscribe from this group, send email to
> > clojure+unsubscr...@googlegroups.com
> > For more options, visit this group at
> > http://groups.google.com/group/clojure?hl=en
> > ---
> > You received this message because you are subscribed to the Google Groups
> > "Clojure" group.
> > To unsubscribe from this group and stop receiving emails from it, send an
> > email

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Moritz Ulrich

I think it would be more efficient to read one of the inputs into a
map for random access instead of iterating it every time.

On Sun, Mar 9, 2014 at 4:48 PM, Timothy Washington  wrote:
> Hey Frank,
>
> Try opening up a repl, and running this for comprehension.
>
> (def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
> (def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}] [:id4
> {:age 60}]])
>
> (for [i user_textfile
> j user_database
> :when (= (first i) (first j))]
> {(first i) (merge (second i) (second j))})
>
> ({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
> from repl
>
>
>
> Hth
>
> Tim Washington
> Interruptsoftware.com
>
>
> On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens  wrote:
>>
>> Hi,
>>
>> i'm investigating if clojure can be used to solve the challenges and
>> problems we have at my day job better than ruby or powershell. A very common
>> use case is validating data from different  systems against some criteria. i
>> believe clojure can be our silver bullet, but before that, it seems to be
>> required to wrap my head around it.
>>
>> So I am starting in the first level with the challenge to validate some
>> data from the user database against our active directory.
>>
>> I already have all the parts to make it work: Which is to make a hash by
>> user_id from the database table, export a textfile from AD, each line
>> representing a user, parse it, merge the information from the
>> user_table_hash, and voila.
>>
>> I did not finish to implement this. So I don't know if this naive approach
>> will work with 400.000 records in the user database and 100.000 in the
>> textfile.
>> But I already think about how I could implement this in a more memory
>> efficient way.
>>
>> So my simple question:
>>
>> I have user_textfile (100.000 records) which can be parsed into a
>> unordered list of user-maps.
>> I have user_table in the database(400.000 record) which I can query with
>> order and gives me an ordered list of user-maps.
>>
>> So I would first order the user_textfile and then conj the user_table
>> ordered list into it, while doing the database query.
>> Is that approach right ? How would I then merge the two ordered lists like
>> in the example below?
>>
>> (defn user_textfile
>>   ([:id1 {:name 'Frank'}]
>>[:id3 {:name 'Tim'}]))
>>
>> (defn user_database
>>   ([:id1 {:age 38}]
>>[:id2 {:age 27}]
>>[:id3 {:age 18}]
>>[:id4 {:age 60}]))
>>
>> (merge-sorted-lists user_database user_textfile)
>> =>
>>   ([:id1 {:name 'Frank' :age 38}]
>>[:id3 {:name 'Tim'   :age 18}]))
>>
>> Any feedback is appreciated.
>> Have a nice day,
>> Frank
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clojure@googlegroups.com
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+unsubscr...@googlegroups.com
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to clojure+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with your
> first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

2014-03-09 Thread Timothy Washington

Hey Frank,

Try opening up a repl, and running this *for* comprehension.

(def user_textfile [[:id1 {:name 'Frank'}] [:id3 {:name 'Tim'}]])
(def user_database [[:id1 {:age 38}] [:id2 {:age 27}] [:id3 {:age 18}]
[:id4 {:age 60}]])

(for [i user_textfile
j user_database
:when (= (first i) (first j))]
{(first i) (merge (second i) (second j))})

*({:id1 {:age 38, :name Frank'}} {:id3 {:age 18, :name Tim'}})  ;; result
from repl *



Hth

Tim Washington
Interruptsoftware.com 


On Sun, Mar 9, 2014 at 5:33 AM, Frank Behrens  wrote:

> Hi,
>
> i'm investigating if clojure can be used to solve the challenges and
> problems we have at my day job better than ruby or powershell. A very
> common use case is validating data from different  systems against some
> criteria. i believe clojure can be our silver bullet, but before that, it
> seems to be required to wrap my head around it.
>
> So I am starting in the first level with the challenge to validate some
> data from the user database against our active directory.
>
> I already have all the parts to make it work: Which is to make a hash by
> user_id from the database table, export a textfile from AD, each line
> representing a user, parse it, merge the information from the
> user_table_hash, and voila.
>
> I did not finish to implement this. So I don't know if this naive approach
> will work with 400.000 records in the user database and 100.000 in the
> textfile.
> But I already think about how I could implement this in a more memory
> efficient way.
>
> So my simple question:
>
> I have user_textfile (100.000 records) which can be parsed into a
> unordered list of user-maps.
> I have user_table in the database(400.000 record) which I can query with
> order and gives me an ordered list of user-maps.
>
> So I would first order the user_textfile and then conj the user_table
> ordered list into it, while doing the database query.
> Is that approach right ? How would I then merge the two ordered lists like
> in the example below?
>
> (defn user_textfile
>   ([:id1 {:name 'Frank'}]
>[:id3 {:name 'Tim'}]))
>
> (defn user_database
>   ([:id1 {:age 38}]
>[:id2 {:age 27}]
>[:id3 {:age 18}]
>[:id4 {:age 60}]))
>
> (merge-sorted-lists user_database user_textfile)
> =>
>   ([:id1 {:name 'Frank' :age 38}]
>[:id3 {:name 'Tim'   :age 18}]))
>
> Any feedback is appreciated.
> Have a nice day,
> Frank
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

Re: Working with big datasets, merging two ordered lists by key

8 matches

Site Navigation

Mail list logo

Footer information