On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski <spo...@gmail.com> wrote:
> If you have trouble loading your data into mysql using INSERTs or LOAD
> DATA, consider that MySQL supports CSV directly using the CSV storage
> engine. The only thing you have to do is to copy your hadoop produced
> csv file into the mysql data directory and issue a "flush tables"
> command to have mysql flush its caches and pickup the new file. Its
> very simple and you have the full set of sql commands available just
> as with innodb or myisam. What you don't get with the csv engine are
> indexes and foreign keys. Can't have it all, can you?
>
> Stefan
>
>
> On Tue, Apr 28, 2009 at 9:23 PM, Bill Habermaas <b...@habermaas.us> wrote:
>> Excellent discussion. Thank you Todd.
>> You're forgiven for being off topic (at least by me).
>> :)
>> Bill
>>
>> -----Original Message-----
>> From: Todd Lipcon [mailto:t...@cloudera.com]
>> Sent: Tuesday, April 28, 2009 2:29 PM
>> To: core-user
>> Subject: Re: Hadoop / MySQL
>>
>> Warning: derailing a bit into MySQL discussion below, but I think enough
>> people have similar use cases that it's worth discussing this even though
>> it's gotten off-topic.
>>
>> 2009/4/28 tim robertson <timrobertson...@gmail.com>
>>
>>>
>>> So we ended up with 2 DBs
>>> - DB1 we insert to, prepare and do batch processing
>>> - DB2 serving the read only web app
>>>
>>
>> This is a pretty reasonable and common architecture. Depending on your
>> specific setup, instead of flip-flopping between DB1 and DB2, you could
>> actually pull snapshots of MyISAM tables off DB1 and load them onto other
>> machines. As long as you've flushed the tables with a read lock, MyISAM
>> tables are transferrable between machines (eg via rsync). Obviously this can
>> get a bit hairy, but it's a nice trick to consider for this kind of
>> workflow.
>>
>>
>>> Why did we end up with this?  Because of locking on writes that kill
>>> reads as you say... basically you can't insert when a read is
>>> happening on myisam as it locks the whole table.
>>
>>
>> This is only true if you have binary logging enabled. Otherwise, myisam
>> supports concurrent inserts with reads. That said, binary logging is
>> necessary if you have any slaves. If you're loading bulk data from the
>> result of a mapreduce job, you might be better off not using replication and
>> simply loading the bulk data to each of the serving replicas individually.
>> Turning off the binary logging will also double your write speed (LOAD DATA
>> writes the entirety of the data to the binary log as well as to the table)
>>
>>
>>>  InnoDB has row level
>>> locking to get around this but in our experience (at the time we had
>>> 130million records) it just didn't work either.
>>
>>
>> You're quite likely to be hitting the InnoDB autoincrement lock if you have
>> an autoincrement primary key here. There are fixes for this in MySQL 5.1.
>> The best solution is to avoid autoincrement primary keys and use LOAD DATA
>> for these kind of bulk loads, as others have suggested.
>>
>>
>>>  We spent €10,000 for
>>> the supposed "european expert" on mysql from their professional
>>> services and were unfortunately very disappointed.  Seems such large
>>> tables are just problematic with mysql.  We are now very much looking
>>> into Lucene technologies for search and Hadoop for reporting and
>>> datamining type operations. SOLR does a lot of what our DB does for
>>> us.
>>>
>>
>> Yep - oftentimes MySQL is not the correct solution, but other times it can
>> be just what you need. If you already have competencies with MySQL and a
>> good access layer from your serving tier, it's often easier to stick with
>> MySQL than add a new technology into the mix.
>>
>>
>>>
>>> So with myisam... here is what we learnt:
>>>
>>> Only very latest mysql versions (beta still I think) support more than
>>> 4G memory for indexes (you really really need the index in memory, and
>>> where possible the FK for joins in the index too).
>>
>>
>> As far as I know, any 64-bit mysql instance will use more than 4G without
>> trouble.
>>
>>
>>>  Mysql has
>>> differing join strategies between innoDB and myisam, so be aware.
>>
>>
>> I don't think this is true. Joining happens at the MySQL execution layer,
>> which is above the storage engine API. The same join strategies are
>> available for both. For a particular query, InnoDB and MyISAM tables may end
>> up providing a different query plan based on the statistics that are
>> collected, but given properly analyzed tables, the strategies will be the
>> same. This is how MySQL allows inter-storage-engine joins. If one engine is
>> providing a better query plan, you can use query hints to enforce that plan
>> (see STRAIGHT_JOIN and FORCE INDEX for example)
>>
>>
>>> An undocumented feature of myisam is you can create memory buffers for
>>> single indexes:
>>> In the my.cnf:
>>>     taxon_concept_cache.key_buffer_size=3990M    -- for some reason
>>> you have to drop a little under 4G
>>>
>>> then in the DB run:
>>>    cache index taxon_concept in taxon_concept_cache;
>>>    load index into cache taxon_concept;
>>>
>>> This allows for making sure an index gets into memory for sure.
>>>
>>
>> But for most use cases and a properly configured machine you're better off
>> letting it use its own caching policies to keep hot indexes in RAM.
>>
>>
>>>
>>> And here is our main cnf changes:
>>>
>>> sort_buffer_size = 128M
>>
>>
>> This is likely to be causing huge slowdowns on ORDER BYs. See
>> http://www.mysqlperformanceblog.com/2007/08/18/how-fast-can-you-sort-data-with-mysql/for
>> example. Try setting to 128K instead.
>>
>>
>>>
>>> read_buffer_size = 64M
>>> read_rnd_buffer_size = 300M
>>
>>
>> read_rnd_buffer_size seems too high to me.
>>
>>
>> Hope this is helpful for anyone else using MySQL as a sink for MapReduce
>> output.
>>
>> -Todd
>>
>>
>>
>>>
>>> On Tue, Apr 28, 2009 at 2:49 PM, Wang Zhong <wangzhong....@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > That's really cool. It seems that Hadoop could work with SQL DBs like
>>> > Mysql with bearable time. I thought when inserting data to Mysql, the
>>> > expense of communication was always a big problem. You got a method to
>>> > reduce the expense.
>>> >
>>> > Using Distribution Databases like HBase is another good choice. It is
>>> > based on Hadoop and HDFS, so there is no problem with communication
>>> > expense.
>>> >
>>> >
>>> >
>>> > On Tue, Apr 28, 2009 at 6:46 PM, Ankur Goel <ankur.g...@corp.aol.com>
>>> wrote:
>>> >>
>>> >> hello hadoop users,
>>> >> Recently I had a chance to lead a team building a log-processing system
>>> that uses Hadoop and MySQL. The system's goal was to process the incoming
>>> information as quickly as possible (real time or near real time), and make
>>> it available for querying in MySQL. I thought it would be good to share the
>>> experience and the challenges with the community. Couldn't think of a better
>>> place than these mailing lists as I am not much of a blogger :-)
>>> >>
>>> >> The information flow in the system looks something like
>>> >>
>>> >> [Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools]
>>> >>
>>> >> Transferring from Apache-Servers to Hadoop was quite easy as we just had
>>> to organize the data in timely buckets (directories). Once that was running
>>> smooth we had to make sure that map-reduce jobs are fired at regular
>>> intervals and they pick up the right data. The jobs would then
>>> process/aggregate the date and dump the info into MySQL shards from the
>>> reducers [we have our own DB partioning set up]. This is where we hit major
>>> bottlenecks [any surprises? :-)]
>>> >>
>>> >> The table engine used was InnoDB as there was a need for fast
>>> replication and writes but only moderate reads (should eventually support
>>> high read rates). The data would take up quite a while to load completely
>>> far away from being near-real time. And so our optimization journey begin.
>>> >>
>>> >> 1. We tried to optimize/tune InnoDB parameters like increasing the
>>> buffer pool size to 75 % of available RAM. This helped but only till the
>>> time DBs were lightly loaded i.e. innoDB had sufficient buffer pool to host
>>> the data and indexes.
>>> >>
>>> >> 2. We also realized that InnoDB has considerable locking overhead
>>> because of which write concurrency is really bad when you have a large
>>> number of concurrent threads doing writes. The default thread concurrency
>>> for us was set to no_of_cpu * 2 = 8 which is what the official documentation
>>> advises as the optimal limit. So we limited the number of reduce tasks and
>>> consequently the number of concurrent writes and boy the performance
>>> improved 4x. We were almost there :-)
>>> >>
>>> >> 3. Next thing we tried is the standard DB optimzation techniques like
>>> de-normalizing the schema and dropping constraints. This gave only a minor
>>> performance improvement, nothing earth shattering. Note that we were already
>>> caching connections in reducers to each MySQL shard and partionining logic
>>> was embedded into reducers.
>>> >>
>>> >> 4. Falling still short of our performance objectives, we finally we
>>> decided to get rid of JDBC writes from reducers and work on an alternative
>>> that uses MySQLs LOAD utility.
>>> >> - The processing would partition the data into MySQL shard specific
>>> files resident in HDFS.
>>> >> - A script would then spawn processes via ssh on different physical
>>> machines to download this data.
>>> >> - Each spawned process just downloads the data for the shard it should
>>> upload to.
>>> >> - All the processes then start uploading data in parallel into their
>>> respective MySQL shards using LOAD DATA infile.
>>> >>
>>> >> This proved to be the fastest approach, even in the wake of increasing
>>> data loads. The enitre processing/loading would complete in less than 6 min.
>>> The system has been holding up quite well so far, even though we've had to
>>> limit the number of days for which we keep the data or else the MySQLs get
>>> overwhelmed.
>>> >>
>>> >> Hope this is helpful to people.
>>> >>
>>> >> Regards
>>> >> -Ankur
>>> >>
>>> >
>>> >
>>> >
>>> > --
>>> > Wang Zhong
>>> >
>>>
>>
>>
>>
>

Here is a thought that just blew my mind. mySQL CSV storage engine is
essentially one of the functions of hadoop-hive. Imagine a type of
"auto replication" that could store the same data in mysql and in
hadoop-hive. You would be able to support short term mysql queries and
long running Hive queries over the same dataset...

Reply via email to