Seeking a function to partially parallelize collection processing

2017-06-16 Thread Tom Connors
I'm looking for a function that would likely be named something like 
"sequential-by" or "parallel-per" that takes some data-producing thing like 
a lazy seq or a core async channel, a function to split records from that 
input, and a function to handle each item. Each item with an identical 
return value from the "split" function would be handled sequentially, while 
the handling of the collection as a whole would be parallel.

If we assume this signature:
(parallel-per splitter handler inputs)


Calling it like this:
(parallel-per :id
  (fn [m] (prn (update m :val inc)))
  [{:id 1 :val 1}, {:id 1 :val 2}, {:id 3 :val 1}])


Would result in the first two maps being handled sequentially, while the 
third map is handled in parallel with the first two. The order of the 
printed lines would probably be non-deterministic, except {:id 1 :val 2} 
would be printed before {:id 1 :val 3}.

Note that for my use case I don't care about return values, but if 
something like this already exists it's plausible that it returns something 
equivalent to (map handler inputs).

Does anything like this already exist in core or some lib? If not, any 
recommendations for how to build it?

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Seeking a function to partially parallelize collection processing

2017-06-16 Thread Jose Figueroa Martinez
Hello,

there are many videos on how parallelize sequential processing on 
ClojureTV, but, the most basic way in Clojure I think is *pmap*

Saludos.


El viernes, 16 de junio de 2017, 9:13:11 (UTC-5), Tom Connors escribió:
>
> I'm looking for a function that would likely be named something like 
> "sequential-by" or "parallel-per" that takes some data-producing thing like 
> a lazy seq or a core async channel, a function to split records from that 
> input, and a function to handle each item. Each item with an identical 
> return value from the "split" function would be handled sequentially, while 
> the handling of the collection as a whole would be parallel.
>
> If we assume this signature:
> (parallel-per splitter handler inputs)
>
>
> Calling it like this:
> (parallel-per :id
>   (fn [m] (prn (update m :val inc)))
>   [{:id 1 :val 1}, {:id 1 :val 2}, {:id 3 :val 1}])
>
>
> Would result in the first two maps being handled sequentially, while the 
> third map is handled in parallel with the first two. The order of the 
> printed lines would probably be non-deterministic, except {:id 1 :val 2} 
> would be printed before {:id 1 :val 3}.
>
> Note that for my use case I don't care about return values, but if 
> something like this already exists it's plausible that it returns something 
> equivalent to (map handler inputs).
>
> Does anything like this already exist in core or some lib? If not, any 
> recommendations for how to build it?
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


feedback on file parsing with Clojure

2017-06-16 Thread AndyK
hello,

i'm looking for some feedback on how i've used Clojure to do some file 
parsing
still getting the hang of Clojure ways of thinking and i'd love to hear any 
advice on how to improve what i've done
for example, i'm guessing the way i've used cond blocks is a bit sketchy - 
at that point, i was kind of in a just-get-it-done mindset 

the file being parsed is here
https://github.com/AndyKriger/i-ching/blob/master/clojure/resources/i-ching.html

the code doing the parsing is here
https://github.com/AndyKriger/i-ching/blob/master/clojure/src/i_ching/parser.clj

the output is here (a browser JSON viewer is advised)
https://raw.githubusercontent.com/AndyKriger/i-ching/master/clojure/resources/i-ching.json

thank you for any help
a

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: feedback on file parsing with Clojure

2017-06-16 Thread Justin Smith
The primary suggestion I'd make here is to replace the doseq/reset!
construction in your main loop with reduce using a hash-map accumulator
representing each value you are updating with a separate key. This isn't
just more idiomatic, it also performs better.

Instead of:

(let [hexagrams (atom (sorted-map))
  state (atom :do-nothing)
  current-hexagram (atom {})]
  (doseq [line (line-seq rdr)]
(let [state-machine (@state PRETTY-STATE-MACHINE)
  line-match (re-matches (:regex state-machine) line)
  [new-state new-hexagram] ((:handler state-machine) line-match
@current-hexagram)]
  (reset! state new-state)
  (reset! current-hexagram new-hexagram)
  (swap! hexagrams assoc (:king-wen-number new-hexagram)
new-hexagram

something like:

(reduce (fn [acc line]
  (let [{:keys [state hexagrams current-hexagram]} acc
state-machine (state PRETTY-STATE-MACHINE)
line-match (re-matches (:regex state-machine) line)
[new-state new-hexagram] ((:handler state-machine)
line-match current-hexagram)]
{:state new-state
 :current-hexagram new-hexagram
 :hexagrams (assoc (:king-wen-number new-hexagram)
new-hexagram)}))
(line-seq rdr))

as a more minor issue, we idiomatically use [a b] instead of (vector a b)

On Fri, Jun 16, 2017 at 9:16 AM AndyK  wrote:

> hello,
>
> i'm looking for some feedback on how i've used Clojure to do some file
> parsing
> still getting the hang of Clojure ways of thinking and i'd love to hear
> any advice on how to improve what i've done
> for example, i'm guessing the way i've used cond blocks is a bit sketchy -
> at that point, i was kind of in a just-get-it-done mindset
>
> the file being parsed is here
>
> https://github.com/AndyKriger/i-ching/blob/master/clojure/resources/i-ching.html
>
> the code doing the parsing is here
>
> https://github.com/AndyKriger/i-ching/blob/master/clojure/src/i_ching/parser.clj
>
> the output is here (a browser JSON viewer is advised)
>
> https://raw.githubusercontent.com/AndyKriger/i-ching/master/clojure/resources/i-ching.json
>
> thank you for any help
> a
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Seeking a function to partially parallelize collection processing

2017-06-16 Thread Tom Connors
Hello Jose,
Thank you for the response, but pmap does not address my use case. It's 
insufficient for two reasons: 1) the entire collection must fit in memory. 
My use case is handling records from a Kinesis stream. and 2) pmap 
parallelizes over the whole collection, whereas I want to parallelize the 
collection handling while handling subsets of the data sequentially, as I 
discussed in my first post.
- Tom

On Friday, June 16, 2017 at 10:13:11 AM UTC-4, Tom Connors wrote:
>
> I'm looking for a function that would likely be named something like 
> "sequential-by" or "parallel-per" that takes some data-producing thing like 
> a lazy seq or a core async channel, a function to split records from that 
> input, and a function to handle each item. Each item with an identical 
> return value from the "split" function would be handled sequentially, while 
> the handling of the collection as a whole would be parallel.
>
> If we assume this signature:
> (parallel-per splitter handler inputs)
>
>
> Calling it like this:
> (parallel-per :id
>   (fn [m] (prn (update m :val inc)))
>   [{:id 1 :val 1}, {:id 1 :val 2}, {:id 3 :val 1}])
>
>
> Would result in the first two maps being handled sequentially, while the 
> third map is handled in parallel with the first two. The order of the 
> printed lines would probably be non-deterministic, except {:id 1 :val 2} 
> would be printed before {:id 1 :val 3}.
>
> Note that for my use case I don't care about return values, but if 
> something like this already exists it's plausible that it returns something 
> equivalent to (map handler inputs).
>
> Does anything like this already exist in core or some lib? If not, any 
> recommendations for how to build it?
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Seeking a function to partially parallelize collection processing

2017-06-16 Thread Justin Smith
pmap is rarely actually useful, but point 1 is false, pmap doesn't require
that it's input or output fit in memory

On Fri, Jun 16, 2017 at 12:52 PM Tom Connors  wrote:

> Hello Jose,
> Thank you for the response, but pmap does not address my use case. It's
> insufficient for two reasons: 1) the entire collection must fit in memory.
> My use case is handling records from a Kinesis stream. and 2) pmap
> parallelizes over the whole collection, whereas I want to parallelize the
> collection handling while handling subsets of the data sequentially, as I
> discussed in my first post.
> - Tom
>
> On Friday, June 16, 2017 at 10:13:11 AM UTC-4, Tom Connors wrote:
>>
>> I'm looking for a function that would likely be named something like
>> "sequential-by" or "parallel-per" that takes some data-producing thing like
>> a lazy seq or a core async channel, a function to split records from that
>> input, and a function to handle each item. Each item with an identical
>> return value from the "split" function would be handled sequentially, while
>> the handling of the collection as a whole would be parallel.
>>
>> If we assume this signature:
>> (parallel-per splitter handler inputs)
>>
>>
>> Calling it like this:
>> (parallel-per :id
>>   (fn [m] (prn (update m :val inc)))
>>   [{:id 1 :val 1}, {:id 1 :val 2}, {:id 3 :val 1}])
>>
>>
>> Would result in the first two maps being handled sequentially, while the
>> third map is handled in parallel with the first two. The order of the
>> printed lines would probably be non-deterministic, except {:id 1 :val 2}
>> would be printed before {:id 1 :val 3}.
>>
>> Note that for my use case I don't care about return values, but if
>> something like this already exists it's plausible that it returns something
>> equivalent to (map handler inputs).
>>
>> Does anything like this already exist in core or some lib? If not, any
>> recommendations for how to build it?
>>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to clojure@googlegroups.com
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> clojure+unsubscr...@googlegroups.com
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to clojure+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Seeking a function to partially parallelize collection processing

2017-06-16 Thread Tom Connors
Thanks Justin. My mistake. Point 2 stands.

On Friday, June 16, 2017 at 3:58:38 PM UTC-4, Justin Smith wrote:
>
> pmap is rarely actually useful, but point 1 is false, pmap doesn't require 
> that it's input or output fit in memory
>
> On Fri, Jun 16, 2017 at 12:52 PM Tom Connors  > wrote:
>
>> Hello Jose,
>> Thank you for the response, but pmap does not address my use case. It's 
>> insufficient for two reasons: 1) the entire collection must fit in memory. 
>> My use case is handling records from a Kinesis stream. and 2) pmap 
>> parallelizes over the whole collection, whereas I want to parallelize the 
>> collection handling while handling subsets of the data sequentially, as I 
>> discussed in my first post.
>> - Tom
>>
>> On Friday, June 16, 2017 at 10:13:11 AM UTC-4, Tom Connors wrote:
>>>
>>> I'm looking for a function that would likely be named something like 
>>> "sequential-by" or "parallel-per" that takes some data-producing thing like 
>>> a lazy seq or a core async channel, a function to split records from that 
>>> input, and a function to handle each item. Each item with an identical 
>>> return value from the "split" function would be handled sequentially, while 
>>> the handling of the collection as a whole would be parallel.
>>>
>>> If we assume this signature:
>>> (parallel-per splitter handler inputs)
>>>
>>>
>>> Calling it like this:
>>> (parallel-per :id
>>>   (fn [m] (prn (update m :val inc)))
>>>   [{:id 1 :val 1}, {:id 1 :val 2}, {:id 3 :val 1}])
>>>
>>>
>>> Would result in the first two maps being handled sequentially, while the 
>>> third map is handled in parallel with the first two. The order of the 
>>> printed lines would probably be non-deterministic, except {:id 1 :val 2} 
>>> would be printed before {:id 1 :val 3}.
>>>
>>> Note that for my use case I don't care about return values, but if 
>>> something like this already exists it's plausible that it returns something 
>>> equivalent to (map handler inputs).
>>>
>>> Does anything like this already exist in core or some lib? If not, any 
>>> recommendations for how to build it?
>>>
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to clo...@googlegroups.com 
>> 
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> clojure+u...@googlegroups.com 
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to clojure+u...@googlegroups.com .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Seeking a function to partially parallelize collection processing

2017-06-16 Thread Jose Figueroa Martinez
Hello Tom,

I think you are talking about distribution, not parallelization. As I see 
(sorry for not reading enough previously) you want a way to handle 
different things in a sequential way where each sequence of things (already 
grouped) are handled in a different thread.

You can put the things in a specific channel (*core.async*) depending on 
your *:id* (for example), and you can handle each specific channel on their 
own thread (*future* maybe).

I can be wrong but if I understood well you have a fairly plausible 
solution using *future*s and *core.async*'s channels.

Excuse me for not helping you more. I'm in a work meeting.


Saludos.


El viernes, 16 de junio de 2017, 9:13:11 (UTC-5), Tom Connors escribió:
>
> I'm looking for a function that would likely be named something like 
> "sequential-by" or "parallel-per" that takes some data-producing thing like 
> a lazy seq or a core async channel, a function to split records from that 
> input, and a function to handle each item. Each item with an identical 
> return value from the "split" function would be handled sequentially, while 
> the handling of the collection as a whole would be parallel.
>
> If we assume this signature:
> (parallel-per splitter handler inputs)
>
>
> Calling it like this:
> (parallel-per :id
>   (fn [m] (prn (update m :val inc)))
>   [{:id 1 :val 1}, {:id 1 :val 2}, {:id 3 :val 1}])
>
>
> Would result in the first two maps being handled sequentially, while the 
> third map is handled in parallel with the first two. The order of the 
> printed lines would probably be non-deterministic, except {:id 1 :val 2} 
> would be printed before {:id 1 :val 3}.
>
> Note that for my use case I don't care about return values, but if 
> something like this already exists it's plausible that it returns something 
> equivalent to (map handler inputs).
>
> Does anything like this already exist in core or some lib? If not, any 
> recommendations for how to build it?
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Stubbornly eager results in clojure.java.jdbc

2017-06-16 Thread Luke Burton

Riddle me this:

https://gist.github.com/hagmonk/a75621b143501966c22f53ed1e2bc36e 


Wherein I synthesize a large table in Postgres, then attempt to lazily load the 
table, discarding each row as I receive it. I tried *many* permutations and 
experiments, but settled on these two tests to illustrate my point. Which is 
that I simply can't get it to work with clojure.java.jdbc.

test1, according to all my research and reading of the source code involved, 
should consume the query results lazily. It does not, and I can't for the life 
of me figure out why. Traffic starts to stream in, and the heap is overwhelmed 
almost immediately. I've deliberately set the heap to 1 GB.

test2 uses a technique I borrowed wholesale from Ghadi Shayban in JDBC-99 
,
 which is to have ResultSet implement IReduceInit. It consumes a nominal amount 
of memory. I've verified it's actually doing something by putting counters in, 
and using YourKit to watch about 20 MB/s of traffic streaming into the JVM. 
It's brilliant, it doesn't even break 200 MB total heap usage.

I used YourKit to track where the memory is being retained for test1. Initially 
I made the mistake of not setting the fetchSize, so I saw an ArrayList inside 
the driver holding the reference. The driver documentation 
 confirms that 
autoCommit must be disabled and the fetchSize set to some non-zero number.

After making that change, YourKit confirmed that the GC root holding all the 
memory was the stack local variable "rs". At least I think it did, as a 
non-expert in this domain. I tried disassembling the functions using 
no.disassemble and the IntelliJ decompiler but I'm not really at the point 
where I understand what to look for.

So my questions are:

1) what am I doing wrong with clojure.java.jdbc?

Note some things I've already tried:

* using row-fn instead of result-set-fn
* using prepared statements
* explicitly setting auto-commit false on the connection
* declaring my result-set-fn with (^{:once true} *fn […]) (I did not see a 
change in the disassembly when using this)
* probably other things I am forgetting

2) in these situations where you suspect that the head of a lazy sequence is 
being retained, how do you reason about it? I'm kind of lucky this one blew the 
heap so quickly, who knows how much of my production code might burning memory 
unnecessarily but not quite as fatally. Do you disassemble the functions and 
observe some smoking gun? How do you peek under the covers to see where the 
problem is? 

Luke.

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.