On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski <spo...@gmail.com> wrote: > If you have trouble loading your data into mysql using INSERTs or LOAD > DATA, consider that MySQL supports CSV directly using the CSV storage > engine. The only thing you have to do is to copy your hadoop produced > csv file into the mysql data directory and issue a "flush tables" > command to have mysql flush its caches and pickup the new file. Its > very simple and you have the full set of sql commands available just > as with innodb or myisam. What you don't get with the csv engine are > indexes and foreign keys. Can't have it all, can you? > > Stefan > > > On Tue, Apr 28, 2009 at 9:23 PM, Bill Habermaas <b...@habermaas.us> wrote: >> Excellent discussion. Thank you Todd. >> You're forgiven for being off topic (at least by me). >> :) >> Bill >> >> -----Original Message----- >> From: Todd Lipcon [mailto:t...@cloudera.com] >> Sent: Tuesday, April 28, 2009 2:29 PM >> To: core-user >> Subject: Re: Hadoop / MySQL >> >> Warning: derailing a bit into MySQL discussion below, but I think enough >> people have similar use cases that it's worth discussing this even though >> it's gotten off-topic. >> >> 2009/4/28 tim robertson <timrobertson...@gmail.com> >> >>> >>> So we ended up with 2 DBs >>> - DB1 we insert to, prepare and do batch processing >>> - DB2 serving the read only web app >>> >> >> This is a pretty reasonable and common architecture. Depending on your >> specific setup, instead of flip-flopping between DB1 and DB2, you could >> actually pull snapshots of MyISAM tables off DB1 and load them onto other >> machines. As long as you've flushed the tables with a read lock, MyISAM >> tables are transferrable between machines (eg via rsync). Obviously this can >> get a bit hairy, but it's a nice trick to consider for this kind of >> workflow. >> >> >>> Why did we end up with this? Because of locking on writes that kill >>> reads as you say... basically you can't insert when a read is >>> happening on myisam as it locks the whole table. >> >> >> This is only true if you have binary logging enabled. Otherwise, myisam >> supports concurrent inserts with reads. That said, binary logging is >> necessary if you have any slaves. If you're loading bulk data from the >> result of a mapreduce job, you might be better off not using replication and >> simply loading the bulk data to each of the serving replicas individually. >> Turning off the binary logging will also double your write speed (LOAD DATA >> writes the entirety of the data to the binary log as well as to the table) >> >> >>> InnoDB has row level >>> locking to get around this but in our experience (at the time we had >>> 130million records) it just didn't work either. >> >> >> You're quite likely to be hitting the InnoDB autoincrement lock if you have >> an autoincrement primary key here. There are fixes for this in MySQL 5.1. >> The best solution is to avoid autoincrement primary keys and use LOAD DATA >> for these kind of bulk loads, as others have suggested. >> >> >>> We spent €10,000 for >>> the supposed "european expert" on mysql from their professional >>> services and were unfortunately very disappointed. Seems such large >>> tables are just problematic with mysql. We are now very much looking >>> into Lucene technologies for search and Hadoop for reporting and >>> datamining type operations. SOLR does a lot of what our DB does for >>> us. >>> >> >> Yep - oftentimes MySQL is not the correct solution, but other times it can >> be just what you need. If you already have competencies with MySQL and a >> good access layer from your serving tier, it's often easier to stick with >> MySQL than add a new technology into the mix. >> >> >>> >>> So with myisam... here is what we learnt: >>> >>> Only very latest mysql versions (beta still I think) support more than >>> 4G memory for indexes (you really really need the index in memory, and >>> where possible the FK for joins in the index too). >> >> >> As far as I know, any 64-bit mysql instance will use more than 4G without >> trouble. >> >> >>> Mysql has >>> differing join strategies between innoDB and myisam, so be aware. >> >> >> I don't think this is true. Joining happens at the MySQL execution layer, >> which is above the storage engine API. The same join strategies are >> available for both. For a particular query, InnoDB and MyISAM tables may end >> up providing a different query plan based on the statistics that are >> collected, but given properly analyzed tables, the strategies will be the >> same. This is how MySQL allows inter-storage-engine joins. If one engine is >> providing a better query plan, you can use query hints to enforce that plan >> (see STRAIGHT_JOIN and FORCE INDEX for example) >> >> >>> An undocumented feature of myisam is you can create memory buffers for >>> single indexes: >>> In the my.cnf: >>> taxon_concept_cache.key_buffer_size=3990M -- for some reason >>> you have to drop a little under 4G >>> >>> then in the DB run: >>> cache index taxon_concept in taxon_concept_cache; >>> load index into cache taxon_concept; >>> >>> This allows for making sure an index gets into memory for sure. >>> >> >> But for most use cases and a properly configured machine you're better off >> letting it use its own caching policies to keep hot indexes in RAM. >> >> >>> >>> And here is our main cnf changes: >>> >>> sort_buffer_size = 128M >> >> >> This is likely to be causing huge slowdowns on ORDER BYs. See >> http://www.mysqlperformanceblog.com/2007/08/18/how-fast-can-you-sort-data-with-mysql/for >> example. Try setting to 128K instead. >> >> >>> >>> read_buffer_size = 64M >>> read_rnd_buffer_size = 300M >> >> >> read_rnd_buffer_size seems too high to me. >> >> >> Hope this is helpful for anyone else using MySQL as a sink for MapReduce >> output. >> >> -Todd >> >> >> >>> >>> On Tue, Apr 28, 2009 at 2:49 PM, Wang Zhong <wangzhong....@gmail.com> >>> wrote: >>> > Hi, >>> > >>> > That's really cool. It seems that Hadoop could work with SQL DBs like >>> > Mysql with bearable time. I thought when inserting data to Mysql, the >>> > expense of communication was always a big problem. You got a method to >>> > reduce the expense. >>> > >>> > Using Distribution Databases like HBase is another good choice. It is >>> > based on Hadoop and HDFS, so there is no problem with communication >>> > expense. >>> > >>> > >>> > >>> > On Tue, Apr 28, 2009 at 6:46 PM, Ankur Goel <ankur.g...@corp.aol.com> >>> wrote: >>> >> >>> >> hello hadoop users, >>> >> Recently I had a chance to lead a team building a log-processing system >>> that uses Hadoop and MySQL. The system's goal was to process the incoming >>> information as quickly as possible (real time or near real time), and make >>> it available for querying in MySQL. I thought it would be good to share the >>> experience and the challenges with the community. Couldn't think of a better >>> place than these mailing lists as I am not much of a blogger :-) >>> >> >>> >> The information flow in the system looks something like >>> >> >>> >> [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools] >>> >> >>> >> Transferring from Apache-Servers to Hadoop was quite easy as we just had >>> to organize the data in timely buckets (directories). Once that was running >>> smooth we had to make sure that map-reduce jobs are fired at regular >>> intervals and they pick up the right data. The jobs would then >>> process/aggregate the date and dump the info into MySQL shards from the >>> reducers [we have our own DB partioning set up]. This is where we hit major >>> bottlenecks [any surprises? :-)] >>> >> >>> >> The table engine used was InnoDB as there was a need for fast >>> replication and writes but only moderate reads (should eventually support >>> high read rates). The data would take up quite a while to load completely >>> far away from being near-real time. And so our optimization journey begin. >>> >> >>> >> 1. We tried to optimize/tune InnoDB parameters like increasing the >>> buffer pool size to 75 % of available RAM. This helped but only till the >>> time DBs were lightly loaded i.e. innoDB had sufficient buffer pool to host >>> the data and indexes. >>> >> >>> >> 2. We also realized that InnoDB has considerable locking overhead >>> because of which write concurrency is really bad when you have a large >>> number of concurrent threads doing writes. The default thread concurrency >>> for us was set to no_of_cpu * 2 = 8 which is what the official documentation >>> advises as the optimal limit. So we limited the number of reduce tasks and >>> consequently the number of concurrent writes and boy the performance >>> improved 4x. We were almost there :-) >>> >> >>> >> 3. Next thing we tried is the standard DB optimzation techniques like >>> de-normalizing the schema and dropping constraints. This gave only a minor >>> performance improvement, nothing earth shattering. Note that we were already >>> caching connections in reducers to each MySQL shard and partionining logic >>> was embedded into reducers. >>> >> >>> >> 4. Falling still short of our performance objectives, we finally we >>> decided to get rid of JDBC writes from reducers and work on an alternative >>> that uses MySQLs LOAD utility. >>> >> - The processing would partition the data into MySQL shard specific >>> files resident in HDFS. >>> >> - A script would then spawn processes via ssh on different physical >>> machines to download this data. >>> >> - Each spawned process just downloads the data for the shard it should >>> upload to. >>> >> - All the processes then start uploading data in parallel into their >>> respective MySQL shards using LOAD DATA infile. >>> >> >>> >> This proved to be the fastest approach, even in the wake of increasing >>> data loads. The enitre processing/loading would complete in less than 6 min. >>> The system has been holding up quite well so far, even though we've had to >>> limit the number of days for which we keep the data or else the MySQLs get >>> overwhelmed. >>> >> >>> >> Hope this is helpful to people. >>> >> >>> >> Regards >>> >> -Ankur >>> >> >>> > >>> > >>> > >>> > -- >>> > Wang Zhong >>> > >>> >> >> >> >
Here is a thought that just blew my mind. mySQL CSV storage engine is essentially one of the functions of hadoop-hive. Imagine a type of "auto replication" that could store the same data in mysql and in hadoop-hive. You would be able to support short term mysql queries and long running Hive queries over the same dataset...