Hi, I am not an expert but can give some ideas. (Correct me if I am wrong please :) )
Regardless of whether you use hbase or hive, data is stored HDFS at the end of the day. What hive provides is an sql interface over raw data. When you load data to hive; you define its fields, columns and parsing strategy etc.. Your data is stored as is in hdfs but hive maintains meta-data tables about this raw data. So you can write sql queries over log data and hive brings results. But use hive for dwh-like operations. When you want to transform data in a different format or get analysis report about data. As I know, Hive is not suitable for real-time queries.. Hbase is a columnar database using HDFS as underlying filesystem. It stores data in its own format. You have to use hbase api when you want to insert a row to database. It does not provide an sql interface. Hbase is suitable if you want real-time insert/select. In your case, you can insert weblogs to hbase in realtime. And than you can query a users clicks over hbase. Hbase returns all clicks with timestamp... Regarding clickstream analysis what I prefer is to write a couple of mapreduce jobs to analyze log data and fill a datamode in a relational database for further analysis and queries. You can execute mapreduce jobs periodically... Once you decide on a "good" data model, generating further reports will be easy... king regards.. I have very basic question regarding Hbase,HDFS and Hive, > > If Hbase, HDFS and Hive can be used to store a log data what is best to > store data means should we store it on HDFS or Hbase or Hive. And are there > any benefits associated with it. > W > Also, Do we need to follow any design principal in terms of data modelling > as I could not find anything on this subject. > > I am trying to learn Hadoop by implementing clickstream analysis use case. > > Thanks, > Kuldeep > >