While I have not profiled this, I would think we can save on repeated de serialization and reuse.
For example, suppose we have a weblog table partitioned by hour. After the information is moved into the DFS we can assume that several processes will want to use this information to generate summaries. So that block may be used multiple times in the next hour. If we keep that block in memory in a binary form we will not need to de serialize it again. We also may be able to read directly from memory rather then from disk thus freeing up disk resources for shuffle-sorting. Also suppose we have a process "select x,y,z form tablea into table table b" another process may operate on table b to create tablec. If we wrote the tableb to a cache as it was being created it would be available for tablec. I have been pondering some possible implementations, this could probably be done on the Hadoop Layer. CachedInputFormat or CachedFileSystem. I am just thinking RAM caches would have to help. Considering blocks do not change caching should not be difficult. On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao <[email protected]> wrote: > That's true. > > My thoughts are more constraint here because of resources :) > > What do you think is the major bottleneck for speed now? I have a vague > feeling that it's the shuffling phase between map and reduce. > > Zheng > > On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]> > wrote: >> >> On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote: >> > +1 >> > >> > Making input data and query results available in a short delay is >> > definitely >> > a very attractive feature for Hive. >> > There are multiple approaches to achieve this, mainly depending on how >> > much >> > we leverage HBase. >> > >> > The simplest way to go is to probably have a good Hive/HBase integration >> > like HIVE-705, HIVE-806 etc. >> > This can help us leverage the efforts done by HBase to the maximum >> > degree. >> > The potential drawback is that HBase tables have support for random >> > writes >> > which may cause additional overhead for simple sequential writes. >> > >> > Eventually we may (or may not) need our own HiveRegionServer which hosts >> > data in any format supported by Hive (on top of just the internal file >> > format supported by HBase), but I feel it might be a good start to first >> > try >> > integrate the two. >> > >> > Zheng >> > >> > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <[email protected]> >> > wrote: >> >> >> >> After sitting though some HDFS/BHase presentations yesterday, I >> >> started thinking. that the hive model or doing its map/reduce over raw >> >> files from HDFS is great, but a dedicated caching/region server could >> >> be a big benefit in answering real time queries. >> >> >> >> I calculated that one data center (not counting non-cachable content) >> >> could have about 378MB of logs a day. Going from facebooks information >> >> here: >> >> http://www.facebook.com/note.php?note_id=110207012002 >> >> >> >> "The log files are named with the date and time of collection. >> >> Individual hourly files are around 55 MB when compressed, so eight >> >> months of compressed data takes up about 300 GB of space." >> >> >> >> During the day and week the logs are collected one would expect the >> >> data to be used very often. So having this in a cached would be ideal. >> >> >> >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB >> >> could be sliced off and as a dedicated HiveRegion server, or it can >> >> run as several dedicated servers. With maybe RAM and nothing else. >> >> >> >> A Hive Region Server would/could contain HiveTables in a compressed >> >> format, maybe hive tables in a derby format, indexes we are creating, >> >> and some information about the usage so different caching algorithms >> >> could evict sections. We could use ZooKeeper to manage the HiveRegions >> >> like in HBase does. >> >> >> >> Hive query optimizer would look to see if the in the data was in the >> >> HiveRegionServer or run as normal. >> >> >> >> Has anyone ever thought of this? >> >> Edward >> > >> > >> > >> > -- >> > Yours, >> > Zheng >> > >> >> I agree that Hive/Hbase integration is a good thing. I think that the >> differences between hive/hbase are vast. Hive is row oriented with >> column support and HBase is column oriented. HBase is working on >> sparse files and needs random inserts while Hive data is mostly Write >> Once Read Many. HBase is working mostly in memory with a commit log >> while Hive writes during the map and reduce phase directly to HDFS. >> >> The way I look at there is already a lot of waist. Imagine jobs are >> done simultaneously or right each other on relatively small data sets. >> >> >> select name,count() from people group by name; >> select * from people >> select * from people where name='sarah' >> >> With a HiveRegionServer sections of data might be already in memory in >> a fast binary form, or on disk in a embedded db like the one used by >> map side joins. Disks would be used on intermediate results rather >> then reprocessing the same chunks data repeatedly. >> >> Managing HiveRegionServers would be much less complex then managing >> HBASE regions that have high random read/inserts. In its simple form >> it would just be an exact duplicate of data, more complex an optimized >> binary form of the data. > > > > -- > Yours, > Zheng >
