After sitting though some HDFS/BHase presentations yesterday, I started thinking. that the hive model or doing its map/reduce over raw files from HDFS is great, but a dedicated caching/region server could be a big benefit in answering real time queries.
I calculated that one data center (not counting non-cachable content) could have about 378MB of logs a day. Going from facebooks information here: http://www.facebook.com/note.php?note_id=110207012002 "The log files are named with the date and time of collection. Individual hourly files are around 55 MB when compressed, so eight months of compressed data takes up about 300 GB of space." During the day and week the logs are collected one would expect the data to be used very often. So having this in a cached would be ideal. Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB could be sliced off and as a dedicated HiveRegion server, or it can run as several dedicated servers. With maybe RAM and nothing else. A Hive Region Server would/could contain HiveTables in a compressed format, maybe hive tables in a derby format, indexes we are creating, and some information about the usage so different caching algorithms could evict sections. We could use ZooKeeper to manage the HiveRegions like in HBase does. Hive query optimizer would look to see if the in the data was in the HiveRegionServer or run as normal. Has anyone ever thought of this? Edward
