Paul,

Like Ning, I doubt that Hive will be very suitable for the low latency response times you are looking for here. HBase might work for you depending on what types of data access you need. For instance, the count query won't work well directly. Rather, you'd have to have application code that uses a counter to keep track of the row or distinct column counts.

I have to believe that Hive is going to be a useful tool for you guys to have for some of the very large data sets that you work with such as experiment observations, but I suspect that you'd probably do much better with a column store database in this particular case. This class of database engine can store many billions of records in a table and still provide sub-second response times for SQL queries. LucidDB, InfiniDB, Infobright, and Vertica are examples of this type of DB engine.

I have experience and familiarity with these engines, and given my love of Nasa, I would be happy to chat with you about your needs and what might be suitable with no consulting strings attached. :)

Daniel Einspanjer
Metrics Architect
Mozilla Corporation

On 9/13/10 10:41 PM, Paul Zimdars wrote:
 Hi All,

We have been using Hadoop (0.20.2+320) and Hive (0.5.0+20) for about a month now to see if we could migrate our existing MySQL DB into a Hadoop/Hive architecture (hadoop/hive rock BTW! :) ). We unfortunately are experiencing slow response times while doing simple tasks such as a count DB query (e.g. hive> select count(blah_id) from blah;). We currently have 2.5B Data Points residing in a single table and hive will take approximately 5-6 minutes to do a count of these 2.5B records (15-17 minutes for 6.8B records). The reduce portion is fast (single reduce since this is a count * query) but the map stage takes the remainder of the time (~95%). We currently have 6 (4 x quad core) systems with approximately 24GB of ram each. We have attempted to add more nodes, increase map tasktrackers (many different #s), change DFS block size (32M, 64M, 128MB, 256M), LZO compression, and many, many other configuration variables (io.sort.factor,io.sort.mb) without much success in lowering the time it takes to complete the count (I do notice a high IO wait on the nodes..no matter how many tasktrackers I run). The size of the DB is approximately ~200GB and with MySQL it takes a few seconds to do both the 2.5B and 6.7B count (I am curious if running this locally without any nodes would result in a quicker response time since the delay appears to be in the mapping stage...). I have come to believe (and read) that hadoop/hive is unfortunately not well suited for this type of work and instead is suited for larger data sets. I am curious if anyone has any ideas on A) improving performance and/or B) similar experiences? I am also curious if maybe something like HBase would be better suited for this type of data (small dataset, many files). We appreciate any input, suggestions, or ideas!.

Thank you!!
Paul Zimdars

Reply via email to