Hi Group, This is the second of two emails, each raising a related idea. The first, canvasses an InputFormat that allows a data-chunk to be targeted to a MR job/cluster node, and is here: http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200810.mbox/[EMAIL PROTECTED]
As an aside, I noticed the following Jira item: "Add support for running MapReduce jobs over data residing in a MySQL table." https://issues.apache.org/jira/browse/HADOOP-2536 Some interested in that may like to investigate MonetDB (monetdb.cwi.nl). Primarily because its 'out of the box' performance is quite impressive. http://monetdb.cwi.nl/projects/monetdb/SQL/Benchmark/TPCH/index.html I'm not trying to suggest that PostreSQL or MySQL should be abandoned - and I certainly don't want to start some 'dispute' :) I'm just observing that MonetDB offers impressive performance which, for a use case I think is 'reasonable', suggests adopting a particular configuration - a DB server on each compute node. The use case set out below does not require any built in MonetDB-on-Hadoop support, or 'MR for MonetDB'. In fact I'm thinking in terms of a user's code being able to CRUD parsed and intermediate data, not just key-value pairs, before writing out final key-value pairs. Hopefully it introduces someone to an alternative DB they find useful, and adds some context/motivation to the additional InputFormat proposed earlier. Specifically, I'm assuming: 1) A DB installation that is not 'tuned/tweaked' (which all three could be), but is just installed. 2) A DB is installed on each cluster node to serve just that node (user scripts would connect to localhost:nnnn) 3) The cluster-nodes are sufficiently resourced/'spec'd for the datasets being queried. 4) The user has been able to load into the DB, the _complete_ data-chunk required by that node's MR job. See my previous email above. 5) User queries are similar to the queries listed in the benchmark above. Hence, the performance figures are a reasonable representation of the performance some users might experience. Some observations. - I definitely appreciate the sense in delgating queries to a DB - rather than re-implment this functionality in user code. - Further, one could have N-slave MySQL servers to handle queries rather than all queue on one server. This would remove the CPU and disk load from the cluster-node machines, loads that local MySQL and PostgreSQL installtions would impose for significant time periods. - There is additional network congestion with remote queries, especially if query results are large in size and/or number. - However, in the case of MySQL and PostgreSQL the network latency is likely to be dominated by the query delay, suggesting it would be reasonable to hand-off queries to dedicated servers, rather than load down cluster nodes for, say, 1-300+ seconds in the case of MySQL. It seems, to me, that apart from the first observation, MonetDB could turn this client-server approach on its head for some use-cases. With MonetDB the network latency would likely dominate the time taken for the queries assumed. Again for some cse cases, this suggets better performance could be achieved if each node has a MonetDB server serving that node. Network congestion is reduced, CPU and disk load is increased but for very small intervals of time. Most importantly results return much faster. However, memory load would increase (an issue on memory skinny nodes?). Given the benchmarks figures refer to out-the-box installations, this preformance indicated should be achievable in a DB-per-node configuration without too much admin effort. If this is the cluster set-up, then it becomes important to have a convenient method of ensuring each node receives the data-chunk it requires to load into the localhost DB. This is where I think my earlier proposal, CanonicalInputformat, adds considerable value - it offers a flexible way to ensure a complete data-chunk is delivered to a MapReduce task's node - correct? In this situation a user _could_ write a script to query a remote/localhost DB. However, it seems important to have the convenience of the CanonicalInputFormat, to be able to target data-chunks to nodes/MR jobs. I think the most important aspect from the above is that it demonstarates a compelling use case for the CanonicalInputFormat I raised earlier, whether or not there is built-in MR support for MonetDB. Thoughts? Am I imagining the benefits of using MonetDB as described? Is explicit MR support for MonetDB worth raising as a feature request on the jira? Cheers Mark