CanonicalInputFormat use case: Hadoop + MonetDB serving localhost

Mark V Thu, 16 Oct 2008 04:31:58 -0700

Hi Group,
This is the second of two emails, each raising a related idea.
The first, canvasses an InputFormat that allows a data-chunk to be
targeted to a MR job/cluster node, and is here:
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200810.mbox/[EMAIL 
PROTECTED]


As an aside, I noticed the following Jira item:
"Add support for running MapReduce jobs over data residing in a MySQL table."
https://issues.apache.org/jira/browse/HADOOP-2536

Some interested in that may like to investigate MonetDB
(monetdb.cwi.nl).  Primarily because its 'out of the box' performance
is quite impressive.
http://monetdb.cwi.nl/projects/monetdb/SQL/Benchmark/TPCH/index.html

I'm not trying to suggest that PostreSQL or MySQL should be abandoned
- and I certainly don't want to start some 'dispute' :)
I'm just observing that MonetDB offers impressive performance which,
for a use case I think is 'reasonable', suggests adopting a particular
configuration - a DB server on each compute node.

The use case set out below does not require any built in
MonetDB-on-Hadoop support, or 'MR for MonetDB'. In fact I'm thinking
in terms of a user's code being able to CRUD parsed and intermediate
data, not just key-value pairs, before writing out final key-value
pairs.
Hopefully it introduces someone to an alternative DB they find useful,
and adds some context/motivation to the additional InputFormat
proposed earlier.

Specifically, I'm assuming:
 1) A DB installation that is not 'tuned/tweaked' (which all three
could be), but is just installed.
 2) A DB is installed on each cluster node to serve just that node
(user scripts would connect to localhost:nnnn)
 3) The cluster-nodes are sufficiently resourced/'spec'd for the
datasets being queried.
 4) The user has been able to load into the DB, the _complete_
data-chunk required by that node's MR job.
    See my previous email above.
 5) User queries are similar to the queries listed in the benchmark above.
Hence, the performance figures are a reasonable representation of the
performance some users might experience.

Some observations.
 - I definitely appreciate the sense in delgating queries to a DB -
rather than re-implment this functionality in user code.
 - Further, one could have N-slave MySQL servers to handle queries
rather than all queue on one server.  This would remove the CPU and
disk load from the cluster-node machines, loads that local MySQL and
PostgreSQL installtions would impose for significant time periods.
 - There is additional network congestion with remote queries,
especially if query results are large in size and/or number.
 - However, in the case of MySQL and PostgreSQL the network latency is
likely to be dominated by the query delay, suggesting it would be
reasonable to hand-off queries to dedicated servers, rather than load
down cluster nodes for, say, 1-300+ seconds in the case of MySQL.

It seems, to me, that apart from the first observation, MonetDB could
turn this client-server approach on its head for some use-cases.
With MonetDB the network latency would likely dominate the time taken
for the queries assumed.
Again for some cse cases, this suggets better performance could be
achieved if each node has a MonetDB server serving that node.
Network congestion is reduced, CPU and disk load is increased but for
very small intervals of time. Most importantly results return much
faster.  However, memory load would increase (an issue on memory
skinny nodes?).

Given the benchmarks figures refer to out-the-box installations, this
preformance indicated should be achievable in a DB-per-node
configuration without too much admin effort.
If this is the cluster set-up, then it becomes important to have a
convenient method of ensuring each node receives the data-chunk it
requires to load into the localhost DB.
This is where I think my earlier proposal, CanonicalInputformat, adds
considerable value - it offers a flexible way to ensure a complete
data-chunk is delivered to a MapReduce task's node - correct?

In this situation a user _could_ write a script to query a remote/localhost DB.
However, it seems important to have the convenience of the
CanonicalInputFormat, to be able to target data-chunks to nodes/MR
jobs.
I think the most important aspect from the above is that it
demonstarates a compelling use case for the CanonicalInputFormat I
raised earlier, whether or not there is built-in MR support for
MonetDB.

Thoughts?  Am I imagining the benefits of using MonetDB as described?
Is explicit MR support for MonetDB worth raising as a feature request
on the jira?

Cheers
Mark

CanonicalInputFormat use case: Hadoop + MonetDB serving localhost

Reply via email to