That's true.

My thoughts are more constraint here because of resources :)

What do you think is the major bottleneck for speed now? I have a vague
feeling that it's the shuffling phase between map and reduce.

Zheng

On Mon, Oct 5, 2009 at 3:31 PM, Edward Capriolo <[email protected]>wrote:

> On Sun, Oct 4, 2009 at 7:24 PM, Zheng Shao <[email protected]> wrote:
> > +1
> >
> > Making input data and query results available in a short delay is
> definitely
> > a very attractive feature for Hive.
> > There are multiple approaches to achieve this, mainly depending on how
> much
> > we leverage HBase.
> >
> > The simplest way to go is to probably have a good Hive/HBase integration
> > like HIVE-705, HIVE-806 etc.
> > This can help us leverage the efforts done by HBase to the maximum
> degree.
> > The potential drawback is that HBase tables have support for random
> writes
> > which may cause additional overhead for simple sequential writes.
> >
> > Eventually we may (or may not) need our own HiveRegionServer which hosts
> > data in any format supported by Hive (on top of just the internal file
> > format supported by HBase), but I feel it might be a good start to first
> try
> > integrate the two.
> >
> > Zheng
> >
> > On Sun, Oct 4, 2009 at 8:42 AM, Edward Capriolo <[email protected]>
> > wrote:
> >>
> >> After sitting though some HDFS/BHase presentations yesterday, I
> >> started thinking. that the hive model or doing its map/reduce over raw
> >> files from HDFS is great, but a dedicated caching/region server could
> >> be a big benefit in answering real time queries.
> >>
> >> I calculated that one data center (not counting non-cachable content)
> >> could have about 378MB of logs a day. Going from facebooks information
> >> here:
> >> http://www.facebook.com/note.php?note_id=110207012002
> >>
> >> "The log files are named with the date and time of collection.
> >> Individual hourly files are around 55 MB when compressed, so eight
> >> months of compressed data takes up about 300 GB of space."
> >>
> >> During the day and week the logs are collected one would expect the
> >> data to be used very often. So having this in a cached would be ideal.
> >>
> >> Given that an average DataNode might have 8 GB or 16 GB of RAM, one GB
> >> could be sliced off and as a dedicated HiveRegion server, or it can
> >> run as several dedicated servers. With maybe RAM and nothing else.
> >>
> >> A Hive Region Server would/could contain HiveTables in a compressed
> >> format, maybe hive tables in a derby format, indexes we are creating,
> >> and some information about the usage so different caching algorithms
> >> could evict sections. We could use ZooKeeper to manage the HiveRegions
> >> like in HBase does.
> >>
> >> Hive query optimizer would look to see if the in the data was  in the
> >> HiveRegionServer or run as normal.
> >>
> >> Has anyone ever thought of this?
> >> Edward
> >
> >
> >
> > --
> > Yours,
> > Zheng
> >
>
> I agree that Hive/Hbase integration is a good thing. I think that the
> differences between hive/hbase are vast. Hive is row oriented with
> column support and HBase is column oriented. HBase is working on
> sparse files and needs random inserts while Hive data is mostly Write
> Once Read Many. HBase is working mostly in memory with a commit log
> while Hive writes during the map and reduce phase directly to HDFS.
>
> The way I look at there is already a lot of waist. Imagine jobs are
> done simultaneously or right each other on relatively small data sets.
>
>
> select name,count() from people group by name;
> select * from people
> select * from people where name='sarah'
>
> With a HiveRegionServer sections of data might be already in memory in
> a fast binary form, or on disk in a embedded db like the one used by
> map side joins. Disks would be used on intermediate results rather
> then reprocessing the same chunks data repeatedly.
>
> Managing HiveRegionServers would be much less complex then managing
> HBASE regions that have high random read/inserts. In its simple form
> it would just be an exact duplicate of data, more complex an optimized
> binary form of the data.
>



-- 
Yours,
Zheng

Reply via email to