Hi Jean Thanks for explanation .
I still have one doubt Why HBase is not good for bulk loads and aggregations (Full table scan) ? Hive will also read each row for aggregation as well as HBase . Can you explain more ? On Wed, Apr 30, 2014 at 5:15 PM, Jean-Marc Spaggiari < [email protected]> wrote: > Hi Shushant, > > Hive and HBase are 2 different things. You can not really use one vs > another one. > > Hive is a query engine against HDFS data. Data can be stored with different > format like flat text, sequence files, Paquet file, or even HBase table. > HBase is both a query engine (Get and scans) and a storage engine on top of > HDFS which allow you to store data for random read and random write. > > Then you can also add tools like Phoenix and Impala in the picture which > will allow you to query the data from HDFS or HBase too. > > A good way to know if HBase is a good fit or not is to ask yourself how you > are going to write into HBase or to read from HBase. HBase is good for > Random Reads and Random Writes. If you only do bulk loads and aggregations > (Full table scan), HBase is not a good fit. If you do random access (Client > information, events details, etc.) HBase is a good fit. > > It's a bit over simplified, but that should give you some starting points. > > > 2014-04-30 4:34 GMT-04:00 Shushant Arora <[email protected]>: > > > I have a requirement of processing huge weblogs on daily basis. > > > > 1. data will come incremental to datastore on daily basis and I need > > cumulative and daily > > distinct user count from logs and after that aggregated data will be > loaded > > in RDBMS like mydql. > > > > 2.data will be loaded in hdfs datawarehouse on daily basis and same will > be > > fetched from Hdfs warehouse after some filtering in RDMS like mysql and > > will be processed there. > > > > Which datawarehouse is suitable for approach 1 and 2 and why?. > > > > Thanks > > Shushant > > >
