On Sun, Nov 8, 2009 at 7:33 AM, Michael Jaaka
<michael.ja...@googlemail.com>wrote:

> Hi! How would you solve such problem:
>
> I have a collection of pairs (key, value) ->
> [  [ "tom" 32 ] [ "tom" 2333 ] [ "anne" 12 ] [ "anne" 55 ] ]
>
> As you can see keys can occur more than once, also that collection is very
> large so it should be evaluated lazily (for example the collection cames
> from stream input).
>
> I have function with two arguments key and sequence of its values, the
> pseudo code would be:
>
> callbackListener(string key, iterator values)
>
> now I would like get such effect that callbackListener will be called twice
> for the example collection.
>
> once with "tom" and iterator (or a lazy seq) to lazy evaluated collection
> of (32 and 2333)
> and second with "anne" and iterator for collection of (12 and 55)
>

Something related to

(let [ks (distinct (map first s))
      vals-for-key (fn [k] (map second (filter #(= (first %) k) s)))]
  (doseq [k ks] (callback k (vals-for-key k))))

or to collect and return the return values of the callbacks, use "for"
instead of "doseq".

The problem is that this will realize s and hold onto its head, as Meikel
points out. To make it really lazy you need something more like

(defn do-it [callback producer]
  (let [ks (fn [p] (distinct (map first (p))))
        vals-for-key (fn [k p] (map second (filter #(= (first %) k) (p))))]
    (doseq [k (ks p)] (callback k (vals-for-key k p))))

and call this with two arguments, the first the callback that will take a
key and a lazy seq of the associated values, and the second a function that
will produce anew a lazy sequence of key-value pairs from your disk file
each time it's called, with the property that the lazy-seq clause that
returns nil when the sequence is exhausted also has the side effect of
closing the input stream so as not to leak file handles.

This will grovel over the file N+1 times where N is the number of distinct
keys, though, and will have in general two input streams open on the file at
the same time, one of them crawling along discovering more distinct keys,
and the other finding all the values for one key, then for the next, etc.

It will also build up a hashset of all of the keys over time, hidden inside
the "distinct" function's returned seq. So, if the summed lengths of the
distinct keys is going to be big enough to blow the heap all on its own,
you've got real problems and you'll probably need to use a scratch disk
file. In that case I'd recommend a more imperative style with a loop/recur
and lazy-seq, where the outer loop starts with a copy of the input file,
runs the inner loop, and then deletes its input and reruns itself with the
file generated by the inner loop (hence starting with a *copy* of the input
file), and the inner loop reads the first key from the current input file,
creates an empty temp file to be the current output file, then calls the
callback with a lazy seq object whose realization actually advances through
the input file, copying entries to the output file that are not matching the
current key and producing lazy seq elements for entries that do match.

The idea is that if your input file foo.dat reads [[x 1] [y 2] [x 3] [z 4]
[y 5]] it will make a copy foo1.tmp, then call the callback with x and a
lazy seq whose realization emits (1 3) while also generating foo2.tmp = [[y
2] [z 4] [y 5]]; then the outer loop deletes foo1.tmp and repeats on
foo2.tmp and the callback is called with y and a lazy seq whose realization
emits (2 5) while generating foo3.tmp = [[z 4]]; and then the outer loop
deletes foo2.tmp and the callback is called with z and a lazy seq (4) whose
realization creates an empty foo4.tmp; then the outer loop deletes foo3.tmp,
sees foo4.tmp is empty, and deletes foo4.tmp and terminates.

If you want the callback's return values, you'll need to make the outer loop
another lazy-seq.

I'm a bit uncomfortable with lazy-seq bodies that have side effects.

You might want to consider if you need to be using a RDBMS instead of just
disk files for whatever it is you're doing.

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to