On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon <t...@cloudera.com> wrote: > On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski <spo...@gmail.com>wrote: > >> If you have trouble loading your data into mysql using INSERTs or LOAD >> DATA, consider that MySQL supports CSV directly using the CSV storage >> engine. The only thing you have to do is to copy your hadoop produced >> csv file into the mysql data directory and issue a "flush tables" >> command to have mysql flush its caches and pickup the new file. Its >> very simple and you have the full set of sql commands available just >> as with innodb or myisam. What you don't get with the csv engine are >> indexes and foreign keys. Can't have it all, can you? >> > > The CSV storage engine is definitely an interesting option, but it has a > couple downsides: > > - Like you mentioned, you don't get indexes. This seems like a huge deal to > me - the reason you want to load data into MySQL instead of just keeping it > in Hadoop is so you can service real-time queries. Not having any indexing > kind of defeats the purpose there. This is especially true since MySQL only > supports nested-loop joins, and there's no way of attaching metadata to a > CSV table to say "hey look, this table is already in sorted order so you can > use a merge join". > > - Since CSV is a text based format, it's likely to be a lot less compact > than a proper table. For example, a unix timestamp is likely to be ~10 > characters vs 4 bytes in a packed table. > > - I'm not aware of many people actually using CSV for anything except > tutorials and training. Since it's not in heavy use by big mysql users, I > wouldn't build a production system around it. > > Here's a wacky idea that I might be interested in hacking up if anyone's > interested: > > What if there were a MyISAMTableOutputFormat in hadoop? You could use this > as a reducer output and have it actually output .frm and .myd files onto > HDFS, then simply hdfs -get them onto DB servers for realtime serving. > Sounds like a fun hack I might be interested in if people would find it > useful. Building the .myi indexes in Hadoop would be pretty killer as well, > but potentially more difficult. > > -Todd >
The .frm and .myd are binary platform dependent files. You can not even move them from 32bit-64bit. Generating them without native tools would be difficult. Moving then around with HDFS might have merit, although the RSYNC could accomplish the same thing. Derby-DB might be a better candidate for something like this since the underlying DB is cross platform.