Re: Database/DBD Bridging?

Darren Duncan Thu, 22 Sep 2011 21:27:39 -0700

Brendan,

Taking into account David's response ...


I have several answers for you:

1. The functionality you are talking about is often referred to as a federateddatabase, where you have one database engine that the client talks to whichturns around and farms some of the work off to various other database engines,coordinating between them and doing some extra work itself. What you initiallyproposed is essentially using Perl as the client-most database engine, and whatDavid proposed is using PostgreSQL as the client-most engine instead.


2.  If PostgreSQL 9.1's SQL/MED abilities will do what you need, then use it.

3. If immense scalability isn't needed and you want to do this in Perl then Irecommend you look at my Set::Relation CPAN module, which will handle a lot ofthe difficult work for you. Specifically, you would still manage for yourselfthe parceling out of queries to the various back-end DBMSs like Oracle or Excelor whatever, but Set::Relation will then take care of all the drudgery of takingthe various rowsets returned from those and combine them into the variousend-result queries you actually wanted to do. Each Set::Relation objectcontains a rowset and you can use its several dozen methods to do relationaljoins or aggregates or antijoins or various other things, all the functionalityof SQL, but in Perl. Its main limiting factor is that it is entirely RAM-based,though this also makes it simpler. So you can do this right now.

4. I am presently implementing a relational DBMS in Perl which provides all thefunctionality you described and more, including query language support that letsyou write code like you demonstrated. Strictly speaking the initial version isfully self-contained for simplicity, but a subsequent/spinoff version would addthe ability to farm out to other database engines as per SQL/MED, and this *is*designed to scale. It even uses the paradigm you mention, where each underlyingengine is essentially a namespace in which tables live, and you can join betweenthem as if they were all local; or to be more accurate, each database*connection* has its own namespace and the underlying engine is just a qualityof that connection, like with how DBI lets you have multiple connections withthe same driver.


So if this sounds like something you want to help create, please talk with me.

-- Darren Duncan

Brendan Byrd wrote:

Okay, this is a big blue sky idea, but like all things open-source, it comes
out of a need.  I'm trying to merge together Excel (or CSV), Oracle, Fusion
Tables, JSON, and SNMP for various data points and outputs.  DBIC seems to
work great for a large database with a bunch of tables, but what about a
bunch of databases?  I've searched and searched, and nobody seemed to have
designed a DBD for multiple DBDs.  There's DBD::Multi and Multiplex, but
that's merely for replication.  This would require reparsing of SQL
statements.

So, let's call this module idea DBD::IntegrateDB or MultiDB.  It would be a
module built from SQL::Statement (using the typical Embed instructions), so
it would use that module's SQL Engine for parsing and processing SQL.

We'll use a simple example of two databases: one Oracle, and one MySQL.
This module loads both of them in with a DBI->connect string.  Then the dev
runs a prepare on the following SQL:

SELECT
   book, title, b.person, age, dob
FROM
   ora.books b
   INNER JOIN mysql.people p ON (
      b.person_id = p.person_id
   )

So, "ora.books" is on the Oracle DB, and "mysql.people" is on the MySQL DB.
The parser for this MultiDB would:

   1. Use SQL::Parser to break down the SQL statement.
   2. Figure out who owns what, in terms of tables and columns.  (Complain
   about ambiguous columns if it has to.)
   3. Use table_info calls to the separate DBI interfaces, including number
   of rows, cardinality (if available), etc.
   4. Store the joining information.
   5. Prepare two *separate* SQL statements for each DB.  It would no longer
   be JOIN queries, but standard queries for the tables (including person_id,
   which wasn't included in the original SELECT statement).

Then when the statement is executed:

   1. The two SQL statements are executed for each DB.
   2. The fetch_row sub would process each row one at a time for each DB.
   3. If two IDs match, send a row back.  Otherwise, cache the data and wait
   for something to match.
   4. Repeat until the rows are exhausted on one or both sides.  (One side
   for INNER, both sides for OUTER.)

Does anything like that exists?  I'm not saying it's an easy operation, but
if something like that can just start off with a simple JOINs at first, it
would be a miracle module.  Imagine linking with more abstract DBI modules:
Oracle to CSV to MySQL to Teradata to Sys to Sponge.  Tell me you're not
excited at the prospect of eventually creating free open-source competitive
frameworks to multi-million dollar "Business Intelligence" software.  Heck,
DBIC is getting there, but the data mining and integration piece is missing.

Re: Database/DBD Bridging?

Reply via email to