That's true. My thoughts are more constraint here because of resources :)
What do you think is the major bottleneck for speed now? I have a vague feeling that it's the shuffling phase between map and reduce. Zheng On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]>wrote: > On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote: > > +1 > > > > Making input data and query results available in a short delay is > definitely > > a very attractive feature for Hive. > > There are multiple approaches to achieve this, mainly depending on how > much > > we leverage HBase. > > > > The simplest way to go is to probably have a good Hive/HBase integration > > like HIVE-705, HIVE-806 etc. > > This can help us leverage the efforts done by HBase to the maximum > degree. > > The potential drawback is that HBase tables have support for random > writes > > which may cause additional overhead for simple sequential writes. > > > > Eventually we may (or may not) need our own HiveRegionServer which hosts > > data in any format supported by Hive (on top of just the internal file > > format supported by HBase), but I feel it might be a good start to first > try > > integrate the two. > > > > Zheng > > > > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <[email protected]> > > wrote: > >> > >> After sitting though some HDFS/BHase presentations yesterday, I > >> started thinking. that the hive model or doing its map/reduce over raw > >> files from HDFS is great, but a dedicated caching/region server could > >> be a big benefit in answering real time queries. > >> > >> I calculated that one data center (not counting non-cachable content) > >> could have about 378MB of logs a day. Going from facebooks information > >> here: > >> http://www.facebook.com/note.php?note_id=110207012002 > >> > >> "The log files are named with the date and time of collection. > >> Individual hourly files are around 55 MB when compressed, so eight > >> months of compressed data takes up about 300 GB of space." > >> > >> During the day and week the logs are collected one would expect the > >> data to be used very often. So having this in a cached would be ideal. > >> > >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB > >> could be sliced off and as a dedicated HiveRegion server, or it can > >> run as several dedicated servers. With maybe RAM and nothing else. > >> > >> A Hive Region Server would/could contain HiveTables in a compressed > >> format, maybe hive tables in a derby format, indexes we are creating, > >> and some information about the usage so different caching algorithms > >> could evict sections. We could use ZooKeeper to manage the HiveRegions > >> like in HBase does. > >> > >> Hive query optimizer would look to see if the in the data was in the > >> HiveRegionServer or run as normal. > >> > >> Has anyone ever thought of this? > >> Edward > > > > > > > > -- > > Yours, > > Zheng > > > > I agree that Hive/Hbase integration is a good thing. I think that the > differences between hive/hbase are vast. Hive is row oriented with > column support and HBase is column oriented. HBase is working on > sparse files and needs random inserts while Hive data is mostly Write > Once Read Many. HBase is working mostly in memory with a commit log > while Hive writes during the map and reduce phase directly to HDFS. > > The way I look at there is already a lot of waist. Imagine jobs are > done simultaneously or right each other on relatively small data sets. > > > select name,count() from people group by name; > select * from people > select * from people where name='sarah' > > With a HiveRegionServer sections of data might be already in memory in > a fast binary form, or on disk in a embedded db like the one used by > map side joins. Disks would be used on intermediate results rather > then reprocessing the same chunks data repeatedly. > > Managing HiveRegionServers would be much less complex then managing > HBASE regions that have high random read/inserts. In its simple form > it would just be an exact duplicate of data, more complex an optimized > binary form of the data. > -- Yours, Zheng
