Hi wonderful Groovy team,

I am really struggling to determine a "straight forward" groovy way to amend a 
simple linear script into one using some level of concurrency.  I cannot find 
suitable examples for the task at hand, namely:

I use derby database to collect data from email files on a server.  So

  1.  I walk the directory tree using a mix of eachDirRecurse and eachFileMatch.
  2.  High level directories are names of mailboxes - so I add these to the 
database, first checking that the mailbox is not already in the db.



               int addUser(def Username)

               {

                              def res = sql.firstRow ("select id from user_info 
where user_name = ?", [Username])

                              if (!res) {

                                             def keys = 
sql.executeInsert("insert into user_info (user_name) "

                                                                           + 
"VALUES (?)", [Username])

                                             return keys[0][0]  //return the 
auto-generated row id number from the db

                              } else {

                                             return res.id

                              }

               }



I guess I could just insert the data - and if it errors with 'duplicate key' 
then I know it already exists, but then I would still need to obtain the row ID 
to return back to the caller.


  1.  And when I find a file that is an email type, I read it line-by-line 
until I obtain the required header details (date: / subject: / from: / 
message-id: ) or a blank line ( end of headers).  I add these details to the 
database (again, checking that the item is not already there).

So I currently use a single SQL connection and a simple loop over the directory 
and files - its simple and works well.  But as there are several million files, 
I really need to use multiple threads.

I read that a "DataSource" is a way to pool Database connections.  I just cant 
see how this works - does it just dynamically create connections on demand [def 
sql = new Sql(mydatasource)], and when the 'sql' variable is garbage collected, 
the connection is returned to the 'pool'?  Is each sql instance "thread safe" 
from each other?
And are prepared statements also in the 'pool' - so the sql statements are not 
parsed every time regardless of the connection used?
As for concurrency...
I have previously used, in a basic sense, threads.  And then I looked at GPARS, 
which seems to be the appropriate way to go. So how might the 'eachDirRecurse' 
and 'eachFileMatch' be altered to a GPARs "withPool" collection loop?  How 
should each loop call the sql routines so they are thread safe - presumably by 
creating sql connection from datasource (pool) > do sql > done.  The withPool 
will create upto cpu-count + 1 - but should I use more with this type of 
process logic?  I assume that I could use the "withPool" within another  
"withPool", so that I can process [the pool count] some mailboxes in 
concurrently and also the files within each mailbox in parallel.

Is there some metric that determines how effective concurrent disk actions 
(just reading in this case) can be - e.g. so I could determine a sensible limit 
on the number of [email] files being read at the same time.  What monitoring 
method would help?

I don't think I need to use "actors" here, nor the "dataflow" feature.
Even after reading Groovy in Action (2ed), it is still not really clear how to 
proceed.  I have googled a lot, but still cannot map my ideas into a GPARS 
solution.  So I thought I should ask the experts - the groovy community - for 
some suggestions or appropriate reading material.

The nearest I have found to a useful template on this topic is 
https://stackoverflow.com/questions/35702351/concurrent-parallel-database-queries-using-groovy
But I just cannot see how or why the db connection pool interacts with the 
GPARs so that the same connection is not grabbed by each concurrent process.

Yours, hopefully,

Merlin Beedell

Reply via email to