Slightly off topic .. As being a non-mySQL solution

We have the same problem computing about 100G of data daily and serving it
online with minimum impact while data refresh.

We are using our in-house clone of amazon dynamo a key value Distributed
hash table store (Prject-Voldemort) for the serving side. Project-voldemort
supports a ReadOnlyStore which uses file based data/index. The interesting
part is that we compute the new data/index on hadoop and just Hot Swap it on
voldemort nodes. Total swap time is roughly scp/rsync time with actual
service impact time being very very minimal (closing and opening file
descriptors)

Thanks a lot for info on this thread have been very interesting.

Best
Bhupesh


On 4/29/09 11:48 AM, "Todd Lipcon" <t...@cloudera.com> wrote:

> On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski <spo...@gmail.com>wrote:
> 
>> If you have trouble loading your data into mysql using INSERTs or LOAD
>> DATA, consider that MySQL supports CSV directly using the CSV storage
>> engine. The only thing you have to do is to copy your hadoop produced
>> csv file into the mysql data directory and issue a "flush tables"
>> command to have mysql flush its caches and pickup the new file. Its
>> very simple and you have the full set of sql commands available just
>> as with innodb or myisam. What you don't get with the csv engine are
>> indexes and foreign keys. Can't have it all, can you?
>> 
> 
> The CSV storage engine is definitely an interesting option, but it has a
> couple downsides:
> 
> - Like you mentioned, you don't get indexes. This seems like a huge deal to
> me - the reason you want to load data into MySQL instead of just keeping it
> in Hadoop is so you can service real-time queries. Not having any indexing
> kind of defeats the purpose there. This is especially true since MySQL only
> supports nested-loop joins, and there's no way of attaching metadata to a
> CSV table to say "hey look, this table is already in sorted order so you can
> use a merge join".
> 
> - Since CSV is a text based format, it's likely to be a lot less compact
> than a proper table. For example, a unix timestamp is likely to be ~10
> characters vs 4 bytes in a packed table.
> 
> - I'm not aware of many people actually using CSV for anything except
> tutorials and training. Since it's not in heavy use by big mysql users, I
> wouldn't build a production system around it.
> 
> Here's a wacky idea that I might be interested in hacking up if anyone's
> interested:
> 
> What if there were a MyISAMTableOutputFormat in hadoop? You could use this
> as a reducer output and have it actually output .frm and .myd files onto
> HDFS, then simply hdfs -get them onto DB servers for realtime serving.
> Sounds like a fun hack I might be interested in if people would find it
> useful. Building the .myi indexes in Hadoop would be pretty killer as well,
> but potentially more difficult.
> 
> -Todd

Reply via email to