+1. Since blocks are read-only, the caching logic should be pretty simple.

Zheng

On Mon, Oct 5, 2009 at 7:22 PM, Edward Capriolo <[email protected]>wrote:

> While I have not profiled this, I would think we can save on repeated
> de serialization and reuse.
>
> For example, suppose we have a weblog table partitioned by hour. After
> the information is moved into the DFS we can assume that several
> processes will want to use this information to generate summaries. So
> that block may be used multiple times in the next hour. If we keep
> that block in memory in a binary form we will not need to de serialize
> it again. We also may be able to read directly from memory rather then
> from disk thus freeing up disk resources for shuffle-sorting.
>
> Also suppose we have a process "select x,y,z form tablea into table
> table b" another process may operate on table b to create tablec. If
> we wrote the tableb to a cache as it was being created it would be
> available for tablec.
>
> I have been pondering some possible implementations, this could
> probably be done on the Hadoop Layer. CachedInputFormat or
> CachedFileSystem. I am just thinking RAM caches would have to help.
> Considering blocks do not change caching should not be difficult.
>
> On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao <[email protected]> wrote:
> > That's true.
> >
> > My thoughts are more constraint here because of resources :)
> >
> > What do you think is the major bottleneck for speed now? I have a vague
> > feeling that it's the shuffling phase between map and reduce.
> >
> > Zheng
> >
> > On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]>
> > wrote:
> >>
> >> On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote:
> >> > +1
> >> >
> >> > Making input data and query results available in a short delay is
> >> > definitely
> >> > a very attractive feature for Hive.
> >> > There are multiple approaches to achieve this, mainly depending on how
> >> > much
> >> > we leverage HBase.
> >> >
> >> > The simplest way to go is to probably have a good Hive/HBase
> integration
> >> > like HIVE-705, HIVE-806 etc.
> >> > This can help us leverage the efforts done by HBase to the maximum
> >> > degree.
> >> > The potential drawback is that HBase tables have support for random
> >> > writes
> >> > which may cause additional overhead for simple sequential writes.
> >> >
> >> > Eventually we may (or may not) need our own HiveRegionServer which
> hosts
> >> > data in any format supported by Hive (on top of just the internal file
> >> > format supported by HBase), but I feel it might be a good start to
> first
> >> > try
> >> > integrate the two.
> >> >
> >> > Zheng
> >> >
> >> > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <
> [email protected]>
> >> > wrote:
> >> >>
> >> >> After sitting though some HDFS/BHase presentations yesterday, I
> >> >> started thinking. that the hive model or doing its map/reduce over
> raw
> >> >> files from HDFS is great, but a dedicated caching/region server could
> >> >> be a big benefit in answering real time queries.
> >> >>
> >> >> I calculated that one data center (not counting non-cachable content)
> >> >> could have about 378MB of logs a day. Going from facebooks
> information
> >> >> here:
> >> >> http://www.facebook.com/note.php?note_id=110207012002
> >> >>
> >> >> "The log files are named with the date and time of collection.
> >> >> Individual hourly files are around 55 MB when compressed, so eight
> >> >> months of compressed data takes up about 300 GB of space."
> >> >>
> >> >> During the day and week the logs are collected one would expect the
> >> >> data to be used very often. So having this in a cached would be
> ideal.
> >> >>
> >> >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one
> GB
> >> >> could be sliced off and as a dedicated HiveRegion server, or it can
> >> >> run as several dedicated servers. With maybe RAM and nothing else.
> >> >>
> >> >> A Hive Region Server would/could contain HiveTables in a compressed
> >> >> format, maybe hive tables in a derby format, indexes we are creating,
> >> >> and some information about the usage so different caching algorithms
> >> >> could evict sections. We could use ZooKeeper to manage the
> HiveRegions
> >> >> like in HBase does.
> >> >>
> >> >> Hive query optimizer would look to see if the in the data was  in the
> >> >> HiveRegionServer or run as normal.
> >> >>
> >> >> Has anyone ever thought of this?
> >> >> Edward
> >> >
> >> >
> >> >
> >> > --
> >> > Yours,
> >> > Zheng
> >> >
> >>
> >> I agree that Hive/Hbase integration is a good thing. I think that the
> >> differences between hive/hbase are vast. Hive is row oriented with
> >> column support and HBase is column oriented. HBase is working on
> >> sparse files and needs random inserts while Hive data is mostly Write
> >> Once Read Many. HBase is working mostly in memory with a commit log
> >> while Hive writes during the map and reduce phase directly to HDFS.
> >>
> >> The way I look at there is already a lot of waist. Imagine jobs are
> >> done simultaneously or right each other on relatively small data sets.
> >>
> >>
> >> select name,count() from people group by name;
> >> select * from people
> >> select * from people where name='sarah'
> >>
> >> With a HiveRegionServer sections of data might be already in memory in
> >> a fast binary form, or on disk in a embedded db like the one used by
> >> map side joins. Disks would be used on intermediate results rather
> >> then reprocessing the same chunks data repeatedly.
> >>
> >> Managing HiveRegionServers would be much less complex then managing
> >> HBASE regions that have high random read/inserts. In its simple form
> >> it would just be an exact duplicate of data, more complex an optimized
> >> binary form of the data.
> >
> >
> >
> > --
> > Yours,
> > Zheng
> >
>



-- 
Yours,
Zheng

Reply via email to