+1

Making input data and query results available in a short delay is definitely
a very attractive feature for Hive.
There are multiple approaches to achieve this, mainly depending on how much
we leverage HBase.

The simplest way to go is to probably have a good Hive/HBase integration
like HIVE-705, HIVE-806 etc.
This can help us leverage the efforts done by HBase to the maximum degree.
The potential drawback is that HBase tables have support for random writes
which may cause additional overhead for simple sequential writes.

Eventually we may (or may not) need our own HiveRegionServer which hosts
data in any format supported by Hive (on top of just the internal file
format supported by HBase), but I feel it might be a good start to first try
integrate the two.

Zheng

On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <[email protected]>wrote:

> After sitting though some HDFS/BHase presentations yesterday, I
> started thinking. that the hive model or doing its map/reduce over raw
> files from HDFS is great, but a dedicated caching/region server could
> be a big benefit in answering real time queries.
>
> I calculated that one data center (not counting non-cachable content)
> could have about 378MB of logs a day. Going from facebooks information
> here:
> http://www.facebook.com/note.php?note_id=110207012002
>
> "The log files are named with the date and time of collection.
> Individual hourly files are around 55 MB when compressed, so eight
> months of compressed data takes up about 300 GB of space."
>
> During the day and week the logs are collected one would expect the
> data to be used very often. So having this in a cached would be ideal.
>
> Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
> could be sliced off and as a dedicated HiveRegion server, or it can
> run as several dedicated servers. With maybe RAM and nothing else.
>
> A Hive Region Server would/could contain HiveTables in a compressed
> format, maybe hive tables in a derby format, indexes we are creating,
> and some information about the usage so different caching algorithms
> could evict sections. We could use ZooKeeper to manage the HiveRegions
> like in HBase does.
>
> Hive query optimizer would look to see if the in the data was  in the
> HiveRegionServer or run as normal.
>
> Has anyone ever thought of this?
> Edward
>



-- 
Yours,
Zheng

Reply via email to