+1. Since blocks are read-only, the caching logic should be pretty simple.
Zheng On Mon, Oct 5, 2009 at 7:22 PM, Edward Capriolo <[email protected]>wrote: > While I have not profiled this, I would think we can save on repeated > de serialization and reuse. > > For example, suppose we have a weblog table partitioned by hour. After > the information is moved into the DFS we can assume that several > processes will want to use this information to generate summaries. So > that block may be used multiple times in the next hour. If we keep > that block in memory in a binary form we will not need to de serialize > it again. We also may be able to read directly from memory rather then > from disk thus freeing up disk resources for shuffle-sorting. > > Also suppose we have a process "select x,y,z form tablea into table > table b" another process may operate on table b to create tablec. If > we wrote the tableb to a cache as it was being created it would be > available for tablec. > > I have been pondering some possible implementations, this could > probably be done on the Hadoop Layer. CachedInputFormat or > CachedFileSystem. I am just thinking RAM caches would have to help. > Considering blocks do not change caching should not be difficult. > > On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao <[email protected]> wrote: > > That's true. > > > > My thoughts are more constraint here because of resources :) > > > > What do you think is the major bottleneck for speed now? I have a vague > > feeling that it's the shuffling phase between map and reduce. > > > > Zheng > > > > On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]> > > wrote: > >> > >> On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote: > >> > +1 > >> > > >> > Making input data and query results available in a short delay is > >> > definitely > >> > a very attractive feature for Hive. > >> > There are multiple approaches to achieve this, mainly depending on how > >> > much > >> > we leverage HBase. > >> > > >> > The simplest way to go is to probably have a good Hive/HBase > integration > >> > like HIVE-705, HIVE-806 etc. > >> > This can help us leverage the efforts done by HBase to the maximum > >> > degree. > >> > The potential drawback is that HBase tables have support for random > >> > writes > >> > which may cause additional overhead for simple sequential writes. > >> > > >> > Eventually we may (or may not) need our own HiveRegionServer which > hosts > >> > data in any format supported by Hive (on top of just the internal file > >> > format supported by HBase), but I feel it might be a good start to > first > >> > try > >> > integrate the two. > >> > > >> > Zheng > >> > > >> > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo < > [email protected]> > >> > wrote: > >> >> > >> >> After sitting though some HDFS/BHase presentations yesterday, I > >> >> started thinking. that the hive model or doing its map/reduce over > raw > >> >> files from HDFS is great, but a dedicated caching/region server could > >> >> be a big benefit in answering real time queries. > >> >> > >> >> I calculated that one data center (not counting non-cachable content) > >> >> could have about 378MB of logs a day. Going from facebooks > information > >> >> here: > >> >> http://www.facebook.com/note.php?note_id=110207012002 > >> >> > >> >> "The log files are named with the date and time of collection. > >> >> Individual hourly files are around 55 MB when compressed, so eight > >> >> months of compressed data takes up about 300 GB of space." > >> >> > >> >> During the day and week the logs are collected one would expect the > >> >> data to be used very often. So having this in a cached would be > ideal. > >> >> > >> >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one > GB > >> >> could be sliced off and as a dedicated HiveRegion server, or it can > >> >> run as several dedicated servers. With maybe RAM and nothing else. > >> >> > >> >> A Hive Region Server would/could contain HiveTables in a compressed > >> >> format, maybe hive tables in a derby format, indexes we are creating, > >> >> and some information about the usage so different caching algorithms > >> >> could evict sections. We could use ZooKeeper to manage the > HiveRegions > >> >> like in HBase does. > >> >> > >> >> Hive query optimizer would look to see if the in the data was in the > >> >> HiveRegionServer or run as normal. > >> >> > >> >> Has anyone ever thought of this? > >> >> Edward > >> > > >> > > >> > > >> > -- > >> > Yours, > >> > Zheng > >> > > >> > >> I agree that Hive/Hbase integration is a good thing. I think that the > >> differences between hive/hbase are vast. Hive is row oriented with > >> column support and HBase is column oriented. HBase is working on > >> sparse files and needs random inserts while Hive data is mostly Write > >> Once Read Many. HBase is working mostly in memory with a commit log > >> while Hive writes during the map and reduce phase directly to HDFS. > >> > >> The way I look at there is already a lot of waist. Imagine jobs are > >> done simultaneously or right each other on relatively small data sets. > >> > >> > >> select name,count() from people group by name; > >> select * from people > >> select * from people where name='sarah' > >> > >> With a HiveRegionServer sections of data might be already in memory in > >> a fast binary form, or on disk in a embedded db like the one used by > >> map side joins. Disks would be used on intermediate results rather > >> then reprocessing the same chunks data repeatedly. > >> > >> Managing HiveRegionServers would be much less complex then managing > >> HBASE regions that have high random read/inserts. In its simple form > >> it would just be an exact duplicate of data, more complex an optimized > >> binary form of the data. > > > > > > > > -- > > Yours, > > Zheng > > > -- Yours, Zheng
