+1 Making input data and query results available in a short delay is definitely a very attractive feature for Hive. There are multiple approaches to achieve this, mainly depending on how much we leverage HBase.
The simplest way to go is to probably have a good Hive/HBase integration like HIVE-705, HIVE-806 etc. This can help us leverage the efforts done by HBase to the maximum degree. The potential drawback is that HBase tables have support for random writes which may cause additional overhead for simple sequential writes. Eventually we may (or may not) need our own HiveRegionServer which hosts data in any format supported by Hive (on top of just the internal file format supported by HBase), but I feel it might be a good start to first try integrate the two. Zheng On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <[email protected]>wrote: > After sitting though some HDFS/BHase presentations yesterday, I > started thinking. that the hive model or doing its map/reduce over raw > files from HDFS is great, but a dedicated caching/region server could > be a big benefit in answering real time queries. > > I calculated that one data center (not counting non-cachable content) > could have about 378MB of logs a day. Going from facebooks information > here: > http://www.facebook.com/note.php?note_id=110207012002 > > "The log files are named with the date and time of collection. > Individual hourly files are around 55 MB when compressed, so eight > months of compressed data takes up about 300 GB of space." > > During the day and week the logs are collected one would expect the > data to be used very often. So having this in a cached would be ideal. > > Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB > could be sliced off and as a dedicated HiveRegion server, or it can > run as several dedicated servers. With maybe RAM and nothing else. > > A Hive Region Server would/could contain HiveTables in a compressed > format, maybe hive tables in a derby format, indexes we are creating, > and some information about the usage so different caching algorithms > could evict sections. We could use ZooKeeper to manage the HiveRegions > like in HBase does. > > Hive query optimizer would look to see if the in the data was in the > HiveRegionServer or run as normal. > > Has anyone ever thought of this? > Edward > -- Yours, Zheng
