After sitting though some HDFS/BHase presentations yesterday, I
started thinking. that the hive model or doing its map/reduce over raw
files from HDFS is great, but a dedicated caching/region server could
be a big benefit in answering real time queries.

I calculated that one data center (not counting non-cachable content)
could have about 378MB of logs a day. Going from facebooks information
here:
http://www.facebook.com/note.php?note_id=110207012002

"The log files are named with the date and time of collection.
Individual hourly files are around 55 MB when compressed, so eight
months of compressed data takes up about 300 GB of space."

During the day and week the logs are collected one would expect the
data to be used very often. So having this in a cached would be ideal.

Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
could be sliced off and as a dedicated HiveRegion server, or it can
run as several dedicated servers. With maybe RAM and nothing else.

A Hive Region Server would/could contain HiveTables in a compressed
format, maybe hive tables in a derby format, indexes we are creating,
and some information about the usage so different caching algorithms
could evict sections. We could use ZooKeeper to manage the HiveRegions
like in HBase does.

Hive query optimizer would look to see if the in the data was  in the
HiveRegionServer or run as normal.

Has anyone ever thought of this?
Edward

Reply via email to