Re: Hadoop / MySQL

Joerg Rieger Tue, 28 Apr 2009 07:03:37 -0700

I remember reading an article last year about something similar doneby Rackspace.

They went through various iterations of their logging system andencountered similar

scaling issues with MySQL.

In the end they started using Hadoop, Lucene and Solr:

"
How Rackspace Now Uses MapReduce and Hadoop to Query Terabytes of Data
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
"


On 28.04.2009, at 15:29, tim robertson wrote:

Sorry, that was not meant to be sent to the list...


On Tue, Apr 28, 2009 at 3:27 PM, tim robertson
<timrobertson...@gmail.com> wrote:
Hi,

[Ankur]: How can make sure this happens?
-- show processlist is how we spot it... literally it takes hours on
our set up so easy to find.

So we ended up with 2 DBs
- DB1 we insert to, prepare and do batch processing
- DB2 serving the read only web app
Periodically we dump the DB1, point the read-only web app at DB1,loadthe dump into DB2, and then DB2 becomes the write DB... and anendless
cycle.  We can afford data latency in our system though.

Why did we end up with this?  Because of locking on writes that kill
reads as you say... basically you can't insert when a read is
happening on myisam as it locks the whole table. InnoDB has rowlevel
locking to get around this but in our experience (at the time we had
130million records) it just didn't work either.  We spent €10,000 for
the supposed "european expert" on mysql from their professional
services and were unfortunately very disappointed.  Seems such large
tables are just problematic with mysql.  We are now very much looking
into Lucene technologies for search and Hadoop for reporting and
datamining type operations. SOLR does a lot of what our DB does for
us.

So with myisam... here is what we learnt:
Only very latest mysql versions (beta still I think) support morethan4G memory for indexes (you really really need the index in memory,and
where possible the FK for joins in the index too).  Mysql has
differing join strategies between innoDB and myisam, so be aware.
An undocumented feature of myisam is you can create memory buffersfor
single indexes:
In the my.cnf:
    taxon_concept_cache.key_buffer_size=3990M    -- for some reason
you have to drop a little under 4G

then in the DB run:
   cache index taxon_concept in taxon_concept_cache;
   load index into cache taxon_concept;

This allows for making sure an index gets into memory for sure.

And here is our main cnf changes:

table_cache = 512
sort_buffer_size = 128M
read_buffer_size = 64M
read_rnd_buffer_size = 300M
thread_cache_size = 8
query_cache_type = 0
query_cache_size = 0
thread_concurrency = 8
# this is used for repair by sorting and is faster than keycache byfar
myisam_max_sort_file_size = 200G
myisam_max_extra_sort_file_size = 200G
# this is used for repair by sorting and is limited to 4G
myisam_sort_buffer_size = 3990M
# maximum value for this is 4g for now but see section on specific
keys for large indexes
key_buffer_size=3990M
tmp_table_size=512M
max_heap_table_size=512M

-- for repair by sorting, here are the required values
myisam_max_sort_file_size = 200G
myisam_max_extra_sort_file_size = 200G
myisam_sort_buffer_size = 3990M


I guess just archive this mail off for if you ever do try myisam for
something... these params are the result of a lot of trial and error
and days of waiting!

Our DB has about 44G of index which we can keep in memory but are
hitting limits - hence on the hadoop train.

Cheers

Tim
On Tue, Apr 28, 2009 at 2:49 PM, Wang Zhong<wangzhong....@gmail.com> wrote:
Hi,
That's really cool. It seems that Hadoop could work with SQL DBslikeMysql with bearable time. I thought when inserting data to Mysql,theexpense of communication was always a big problem. You got amethod to
reduce the expense.
Using Distribution Databases like HBase is another good choice. Itis
based on Hadoop and HDFS, so there is no problem with communication
expense.
On Tue, Apr 28, 2009 at 6:46 PM, Ankur Goel<ankur.g...@corp.aol.com> wrote:
hello hadoop users,
Recently I had a chance to lead a team building a log-processingsystem that uses Hadoop and MySQL. The system's goal was toprocess the incoming information as quickly as possible (realtime or near real time), and make it available for querying inMySQL. I thought it would be good to share the experience and thechallenges with the community. Couldn't think of a better placethan these mailing lists as I am not much of a blogger :-)
The information flow in the system looks something like

[Apache-Servers] -> [Hadoop] -> [MySQL-shards] -> [Query-Tools]
Transferring from Apache-Servers to Hadoop was quite easy as wejust had to organize the data in timely buckets (directories).Once that was running smooth we had to make sure that map-reducejobs are fired at regular intervals and they pick up the rightdata. The jobs would then process/aggregate the date and dump theinfo into MySQL shards from the reducers [we have our own DBpartioning set up]. This is where we hit major bottlenecks [anysurprises? :-)]
The table engine used was InnoDB as there was a need for fastreplication and writes but only moderate reads (should eventuallysupport high read rates). The data would take up quite a while toload completely far away from being near-real time. And so ouroptimization journey begin.
1. We tried to optimize/tune InnoDB parameters like increasingthe buffer pool size to 75 % of available RAM. This helped butonly till the time DBs were lightly loaded i.e. innoDB hadsufficient buffer pool to host the data and indexes.
2. We also realized that InnoDB has considerable locking overheadbecause of which write concurrency is really bad when you have alarge number of concurrent threads doing writes. The defaultthread concurrency for us was set to no_of_cpu * 2 = 8 which iswhat the official documentation advises as the optimal limit. Sowe limited the number of reduce tasks and consequently the numberof concurrent writes and boy the performance improved 4x. We werealmost there :-)
3. Next thing we tried is the standard DB optimzation techniqueslike de-normalizing the schema and dropping constraints. Thisgave only a minor performance improvement, nothing earthshattering. Note that we were already caching connections inreducers to each MySQL shard and partionining logic was embeddedinto reducers.
4. Falling still short of our performance objectives, we finallywe decided to get rid of JDBC writes from reducers and work on analternative that uses MySQLs LOAD utility.- The processing would partition the data into MySQL shardspecific files resident in HDFS.- A script would then spawn processes via ssh on differentphysical machines to download this data.- Each spawned process just downloads the data for the shard itshould upload to.- All the processes then start uploading data in parallel intotheir respective MySQL shards using LOAD DATA infile.
This proved to be the fastest approach, even in the wake ofincreasing data loads. The enitre processing/loading wouldcomplete in less than 6 min. The system has been holding up quitewell so far, even though we've had to limit the number of daysfor which we keep the data or else the MySQLs get overwhelmed.
Hope this is helpful to people.

Regards
-Ankur
--
Wang Zhong

--

Re: Hadoop / MySQL

Reply via email to