This might help as well.
http://issues.apache.org/jira/browse/HADOOP-288

On Tue, Oct 6, 2009 at 4:03 AM, Zheng Shao <[email protected]> wrote:
> +1. Since blocks are read-only, the caching logic should be pretty simple.
>
>
> Zheng
>
> On Mon, Oct 5, 2009 at 7:22 PM, Edward Capriolo <[email protected]>
> wrote:
>>
>> While I have not profiled this, I would think we can save on repeated
>> de serialization and reuse.
>>
>> For example, suppose we have a weblog table partitioned by hour. After
>> the information is moved into the DFS we can assume that several
>> processes will want to use this information to generate summaries. So
>> that block may be used multiple times in the next hour. If we keep
>> that block in memory in a binary form we will not need to de serialize
>> it again. We also may be able to read directly from memory rather then
>> from disk thus freeing up disk resources for shuffle-sorting.
>>
>> Also suppose we have a process "select x,y,z form tablea into table
>> table b" another process may operate on table b to create tablec. If
>> we wrote the tableb to a cache as it was being created it would be
>> available for tablec.
>>
>> I have been pondering some possible implementations, this could
>> probably be done on the Hadoop Layer. CachedInputFormat or
>> CachedFileSystem. I am just thinking RAM caches would have to help.
>> Considering blocks do not change caching should not be difficult.
>>
>> On Mon, Oct 5, 2009 at 9:32 PM, Zheng Shao <[email protected]> wrote:
>> > That's true.
>> >
>> > My thoughts are more constraint here because of resources :)
>> >
>> > What do you think is the major bottleneck for speed now? I have a vague
>> > feeling that it's the shuffling phase between map and reduce.
>> >
>> > Zheng
>> >
>> > On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]>
>> > wrote:
>> >>
>> >> On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote:
>> >> > +1
>> >> >
>> >> > Making input data and query results available in a short delay is
>> >> > definitely
>> >> > a very attractive feature for Hive.
>> >> > There are multiple approaches to achieve this, mainly depending on
>> >> > how
>> >> > much
>> >> > we leverage HBase.
>> >> >
>> >> > The simplest way to go is to probably have a good Hive/HBase
>> >> > integration
>> >> > like HIVE-705, HIVE-806 etc.
>> >> > This can help us leverage the efforts done by HBase to the maximum
>> >> > degree.
>> >> > The potential drawback is that HBase tables have support for random
>> >> > writes
>> >> > which may cause additional overhead for simple sequential writes.
>> >> >
>> >> > Eventually we may (or may not) need our own HiveRegionServer which
>> >> > hosts
>> >> > data in any format supported by Hive (on top of just the internal
>> >> > file
>> >> > format supported by HBase), but I feel it might be a good start to
>> >> > first
>> >> > try
>> >> > integrate the two.
>> >> >
>> >> > Zheng
>> >> >
>> >> > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo
>> >> > <[email protected]>
>> >> > wrote:
>> >> >>
>> >> >> After sitting though some HDFS/BHase presentations yesterday, I
>> >> >> started thinking. that the hive model or doing its map/reduce over
>> >> >> raw
>> >> >> files from HDFS is great, but a dedicated caching/region server
>> >> >> could
>> >> >> be a big benefit in answering real time queries.
>> >> >>
>> >> >> I calculated that one data center (not counting non-cachable
>> >> >> content)
>> >> >> could have about 378MB of logs a day. Going from facebooks
>> >> >> information
>> >> >> here:
>> >> >> http://www.facebook.com/note.php?note_id=110207012002
>> >> >>
>> >> >> "The log files are named with the date and time of collection.
>> >> >> Individual hourly files are around 55 MB when compressed, so eight
>> >> >> months of compressed data takes up about 300 GB of space."
>> >> >>
>> >> >> During the day and week the logs are collected one would expect the
>> >> >> data to be used very often. So having this in a cached would be
>> >> >> ideal.
>> >> >>
>> >> >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one
>> >> >> GB
>> >> >> could be sliced off and as a dedicated HiveRegion server, or it can
>> >> >> run as several dedicated servers. With maybe RAM and nothing else.
>> >> >>
>> >> >> A Hive Region Server would/could contain HiveTables in a compressed
>> >> >> format, maybe hive tables in a derby format, indexes we are
>> >> >> creating,
>> >> >> and some information about the usage so different caching algorithms
>> >> >> could evict sections. We could use ZooKeeper to manage the
>> >> >> HiveRegions
>> >> >> like in HBase does.
>> >> >>
>> >> >> Hive query optimizer would look to see if the in the data was  in
>> >> >> the
>> >> >> HiveRegionServer or run as normal.
>> >> >>
>> >> >> Has anyone ever thought of this?
>> >> >> Edward
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Yours,
>> >> > Zheng
>> >> >
>> >>
>> >> I agree that Hive/Hbase integration is a good thing. I think that the
>> >> differences between hive/hbase are vast. Hive is row oriented with
>> >> column support and HBase is column oriented. HBase is working on
>> >> sparse files and needs random inserts while Hive data is mostly Write
>> >> Once Read Many. HBase is working mostly in memory with a commit log
>> >> while Hive writes during the map and reduce phase directly to HDFS.
>> >>
>> >> The way I look at there is already a lot of waist. Imagine jobs are
>> >> done simultaneously or right each other on relatively small data sets.
>> >>
>> >>
>> >> select name,count() from people group by name;
>> >> select * from people
>> >> select * from people where name='sarah'
>> >>
>> >> With a HiveRegionServer sections of data might be already in memory in
>> >> a fast binary form, or on disk in a embedded db like the one used by
>> >> map side joins. Disks would be used on intermediate results rather
>> >> then reprocessing the same chunks data repeatedly.
>> >>
>> >> Managing HiveRegionServers would be much less complex then managing
>> >> HBASE regions that have high random read/inserts. In its simple form
>> >> it would just be an exact duplicate of data, more complex an optimized
>> >> binary form of the data.
>> >
>> >
>> >
>> > --
>> > Yours,
>> > Zheng
>> >
>
>
>
> --
> Yours,
> Zheng
>

Reply via email to