On 26.06.13 15:25, David Nicol wrote:
On Wed, Jun 26, 2013 at 7:26 AM, Andrew Snyder <a...@dancingjars.com
<mailto:a...@dancingjars.com>> wrote:
I want to write a query like:
select clients.client.client_id, columnar.sales.total_sales,
web.page_hits from clients, columnar, web
where clients.client_id = columnar.client_id
and clients.client_id = web.client_id
in a system where 'clients' is actually one or more relational
databases, 'columnar' is one or columnar databases, and 'web' is the
Apache logs on one or more web servers. The dbi driver would be
configured to connect to the correct databases and filter web hits
based on 'client_id'.
Has somebody written that already?
Thanks,
Andrew
it seems like the right thing to do here would be to do three queries,
against the three data sources, and store all the results in a hash of
arrays, then dump the results. Any solution that automates it will wind
up doing at least that anyway, and might not be optimized for the join.
S::S is not bad in joining tables - it's bad in optimizing queries.
I cannot fetch the smalles table first or first query those tables which
have constants in where clauses. So bad queries might cause gigabytes
on memory are wasted for resulting 50 lines.
That is what I meant by "not optimized" :)
Unless there really are so many client IDs that you need to process the
results as a stream or run out of memory, which is unlikely.
while (my ($c_id, $ar) = each %resultz){
$ar->[0] or next; # filter out client_id not appearing in
clients database
print join( "\t", $c_id, 0+$ar->[1], 0+$ar->[2]),"\n";
}
Well - even if not optimized, the implementation of SQL::Statement is
even better. And the datasources for S::S are easy to write - it
finally requires an open_table and fetch_row method.
Two parallel hashes containing the web and columnar results, accessed
once for each result from querying the clients table, would also work.
Yes, that's something S::S can't do out of the box. But it could do, I
have a version of DBD::Sys where during open_table() the query is send
against the data-sources and first fetch_row() synchronized the results
of the appropriate queries.
Cheers
--
Jens Rehsack