Re: Confusing questions ! Hadoop Beginner

yavuz gokirmak Thu, 10 May 2012 07:33:52 -0700

Hi,

I am not an expert but can give some ideas. (Correct me if I am wrong
please :) )


Regardless of whether you use hbase or hive, data is stored HDFS at the end
of the day.

What hive provides is an sql interface over raw data. When you load data to
hive; you define its fields, columns and parsing strategy etc.. Your data
is stored as is in hdfs but hive maintains meta-data tables about this raw
data. So you can write sql queries over log data and hive brings results.
But use hive for dwh-like operations. When you want to transform data in a
different format or get analysis report about data. As I know, Hive is not
suitable for real-time queries..

Hbase is a columnar database using HDFS as underlying filesystem. It stores
data in its own format. You have to use hbase api when you want to insert a
row to database. It does not provide an sql interface. Hbase is suitable if
you want real-time insert/select. In your case, you can insert weblogs to
hbase in realtime. And than you can query a users clicks over hbase. Hbase
returns all clicks with timestamp...

Regarding clickstream analysis what I prefer is to write a couple of
mapreduce jobs to analyze log data and fill a datamode in a relational
database for further analysis and queries. You can execute mapreduce jobs
periodically... Once you decide on a "good" data model, generating further
reports will be easy...


king regards..


I have very basic question regarding Hbase,HDFS and Hive,
>
> If Hbase, HDFS and Hive can be used to store a log data what is best to
> store data means should we store it on HDFS or Hbase or Hive. And are there
> any benefits associated with it.
>

W




> Also, Do we need to follow any design principal in terms of data modelling
> as I could not find anything on this subject.
>
> I am trying to learn Hadoop by implementing clickstream analysis use case.
>
> Thanks,
> Kuldeep
>
>

Re: Confusing questions ! Hadoop Beginner

Reply via email to