Re: Hadoop/Hive observations

Daniel Einspanjer Tue, 14 Sep 2010 07:28:48 -0700

 Paul,

Like Ning, I doubt that Hive will be very suitable for the low latencyresponse times you are looking for here. HBase might work for youdepending on what types of data access you need. For instance, thecount query won't work well directly. Rather, you'd have to haveapplication code that uses a counter to keep track of the row ordistinct column counts.

I have to believe that Hive is going to be a useful tool for you guys tohave for some of the very large data sets that you work with such asexperiment observations, but I suspect that you'd probably do muchbetter with a column store database in this particular case. This classof database engine can store many billions of records in a table andstill provide sub-second response times for SQL queries. LucidDB,InfiniDB, Infobright, and Vertica are examples of this type of DB engine.

I have experience and familiarity with these engines, and given my loveof Nasa, I would be happy to chat with you about your needs and whatmight be suitable with no consulting strings attached. :)


Daniel Einspanjer
Metrics Architect
Mozilla Corporation

On 9/13/10 10:41 PM, Paul Zimdars wrote:

 Hi All,
We have been using Hadoop (0.20.2+320) and Hive (0.5.0+20) for about amonth now to see if we could migrate our existing MySQL DB into aHadoop/Hive architecture (hadoop/hive rock BTW! :) ). We unfortunatelyare experiencing slow response times while doing simple tasks such asa count DB query (e.g. hive> select count(blah_id) from blah;). Wecurrently have 2.5B Data Points residing in a single table and hivewill take approximately 5-6 minutes to do a count of these 2.5Brecords (15-17 minutes for 6.8B records). The reduce portion is fast(single reduce since this is a count * query) but the map stage takesthe remainder of the time (~95%). We currently have 6 (4 x quad core)systems with approximately 24GB of ram each. We have attempted to addmore nodes, increase map tasktrackers (many different #s), change DFSblock size (32M, 64M, 128MB, 256M), LZO compression, and many, manyother configuration variables (io.sort.factor,io.sort.mb) without muchsuccess in lowering the time it takes to complete the count (I donotice a high IO wait on the nodes..no matter how many tasktrackers Irun). The size of the DB is approximately ~200GB and with MySQL ittakes a few seconds to do both the 2.5B and 6.7B count (I am curiousif running this locally without any nodes would result in a quickerresponse time since the delay appears to be in the mapping stage...).I have come to believe (and read) that hadoop/hive is unfortunatelynot well suited for this type of work and instead is suited for largerdata sets. I am curious if anyone has any ideas on A) improvingperformance and/or B) similar experiences? I am also curious if maybesomething like HBase would be better suited for this type of data(small dataset, many files). We appreciate any input, suggestions, orideas!.
Thank you!!
Paul Zimdars

Re: Hadoop/Hive observations

Reply via email to