This might help as well. http://issues.apache.org/jira/browse/HADOOP-288
On Tue, Oct 6, 2009 at 4:03 AM, Zheng Shao <[email protected]> wrote: > +1. Since blocks are read-only, the caching logic should be pretty simple. > > > Zheng > > On Mon, Oct 5, 2009 at 7:22 PM, Edward Capriolo <[email protected]> > wrote: >> >> While I have not profiled this, I would think we can save on repeated >> de serialization and reuse. >> >> For example, suppose we have a weblog table partitioned by hour. After >> the information is moved into the DFS we can assume that several >> processes will want to use this information to generate summaries. So >> that block may be used multiple times in the next hour. If we keep >> that block in memory in a binary form we will not need to de serialize >> it again. We also may be able to read directly from memory rather then >> from disk thus freeing up disk resources for shuffle-sorting. >> >> Also suppose we have a process "select x,y,z form tablea into table >> table b" another process may operate on table b to create tablec. If >> we wrote the tableb to a cache as it was being created it would be >> available for tablec. >> >> I have been pondering some possible implementations, this could >> probably be done on the Hadoop Layer. CachedInputFormat or >> CachedFileSystem. I am just thinking RAM caches would have to help. >> Considering blocks do not change caching should not be difficult. >> >> On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao <[email protected]> wrote: >> > That's true. >> > >> > My thoughts are more constraint here because of resources :) >> > >> > What do you think is the major bottleneck for speed now? I have a vague >> > feeling that it's the shuffling phase between map and reduce. >> > >> > Zheng >> > >> > On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]> >> > wrote: >> >> >> >> On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote: >> >> > +1 >> >> > >> >> > Making input data and query results available in a short delay is >> >> > definitely >> >> > a very attractive feature for Hive. >> >> > There are multiple approaches to achieve this, mainly depending on >> >> > how >> >> > much >> >> > we leverage HBase. >> >> > >> >> > The simplest way to go is to probably have a good Hive/HBase >> >> > integration >> >> > like HIVE-705, HIVE-806 etc. >> >> > This can help us leverage the efforts done by HBase to the maximum >> >> > degree. >> >> > The potential drawback is that HBase tables have support for random >> >> > writes >> >> > which may cause additional overhead for simple sequential writes. >> >> > >> >> > Eventually we may (or may not) need our own HiveRegionServer which >> >> > hosts >> >> > data in any format supported by Hive (on top of just the internal >> >> > file >> >> > format supported by HBase), but I feel it might be a good start to >> >> > first >> >> > try >> >> > integrate the two. >> >> > >> >> > Zheng >> >> > >> >> > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo >> >> > <[email protected]> >> >> > wrote: >> >> >> >> >> >> After sitting though some HDFS/BHase presentations yesterday, I >> >> >> started thinking. that the hive model or doing its map/reduce over >> >> >> raw >> >> >> files from HDFS is great, but a dedicated caching/region server >> >> >> could >> >> >> be a big benefit in answering real time queries. >> >> >> >> >> >> I calculated that one data center (not counting non-cachable >> >> >> content) >> >> >> could have about 378MB of logs a day. Going from facebooks >> >> >> information >> >> >> here: >> >> >> http://www.facebook.com/note.php?note_id=110207012002 >> >> >> >> >> >> "The log files are named with the date and time of collection. >> >> >> Individual hourly files are around 55 MB when compressed, so eight >> >> >> months of compressed data takes up about 300 GB of space." >> >> >> >> >> >> During the day and week the logs are collected one would expect the >> >> >> data to be used very often. So having this in a cached would be >> >> >> ideal. >> >> >> >> >> >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one >> >> >> GB >> >> >> could be sliced off and as a dedicated HiveRegion server, or it can >> >> >> run as several dedicated servers. With maybe RAM and nothing else. >> >> >> >> >> >> A Hive Region Server would/could contain HiveTables in a compressed >> >> >> format, maybe hive tables in a derby format, indexes we are >> >> >> creating, >> >> >> and some information about the usage so different caching algorithms >> >> >> could evict sections. We could use ZooKeeper to manage the >> >> >> HiveRegions >> >> >> like in HBase does. >> >> >> >> >> >> Hive query optimizer would look to see if the in the data was in >> >> >> the >> >> >> HiveRegionServer or run as normal. >> >> >> >> >> >> Has anyone ever thought of this? >> >> >> Edward >> >> > >> >> > >> >> > >> >> > -- >> >> > Yours, >> >> > Zheng >> >> > >> >> >> >> I agree that Hive/Hbase integration is a good thing. I think that the >> >> differences between hive/hbase are vast. Hive is row oriented with >> >> column support and HBase is column oriented. HBase is working on >> >> sparse files and needs random inserts while Hive data is mostly Write >> >> Once Read Many. HBase is working mostly in memory with a commit log >> >> while Hive writes during the map and reduce phase directly to HDFS. >> >> >> >> The way I look at there is already a lot of waist. Imagine jobs are >> >> done simultaneously or right each other on relatively small data sets. >> >> >> >> >> >> select name,count() from people group by name; >> >> select * from people >> >> select * from people where name='sarah' >> >> >> >> With a HiveRegionServer sections of data might be already in memory in >> >> a fast binary form, or on disk in a embedded db like the one used by >> >> map side joins. Disks would be used on intermediate results rather >> >> then reprocessing the same chunks data repeatedly. >> >> >> >> Managing HiveRegionServers would be much less complex then managing >> >> HBASE regions that have high random read/inserts. In its simple form >> >> it would just be an exact duplicate of data, more complex an optimized >> >> binary form of the data. >> > >> > >> > >> > -- >> > Yours, >> > Zheng >> > > > > > -- > Yours, > Zheng >
