Re: How to design a data warehouse in HBase?

Michel Segel Thu, 13 Dec 2012 16:50:14 -0800

I don't know that I would recommend Impala at this stage in its development.
Sorry, it has a bit of growing up.


It's interesting, but no UDFs, right?

Sent from a remote device. Please excuse any typos...

Mike Segel

On Dec 13, 2012, at 4:42 PM, "Kevin O'dell" <kevin.od...@cloudera.com> wrote:

> Correct, Impala relies on the Hive Metastore.
> 
> On Thu, Dec 13, 2012 at 11:38 AM, Manoj Babu <manoj...@gmail.com> wrote:
> 
>> Kevin,
>> 
>> Impala requires Hive right?
>> so to get the advantages of Impala do we need to go with Hive?
>> 
>> 
>> Cheers!
>> Manoj.
>> 
>> 
>> 
>> On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq <donta...@gmail.com>
>> wrote:
>> 
>>> Thank you so much for the clarification Kevin.
>>> 
>>> Regards,
>>>    Mohammad Tariq
>>> 
>>> 
>>> 
>>> On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <kevin.od...@cloudera.com
>>>> wrote:
>>> 
>>>> Mohammad,
>>>> 
>>>>  I am not sure you are thinking about Impala correctly.  It still uses
>>>> HDFS so your data increasing over time is fine.  You are not going to
>>> need
>>>> to tune for special CPU, Storage, or Network.  Typically with Impala
>> you
>>>> are going to be bound at the disks as it functions off of data
>> locality.
>>>> You can also use compression of Snappy, GZip, and BZip to help with
>> the
>>>> amount of data you are storing.  You will not need to frequently update
>>>> your hardware.
>>>> 
>>>> On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <donta...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Oh yes..Impala..good point by Kevin.
>>>>> 
>>>>> Kevin : Would it be appropriate if I say that I should go for Impala
>> if
>>>> my
>>>>> data is not going to increase dramatically over time or if I have to
>>> work
>>>>> on only a subset of my BigData?Since Impala uses MPP, it may
>>>>> require specialized hardware tuned for CPU, storage and network
>>>> performance
>>>>> for better results, which could become a problem if have to upgrade
>> the
>>>>> hardware frequently because of the growing data.
>>>>> 
>>>>> Regards,
>>>>>    Mohammad Tariq
>>>>> 
>>>>> 
>>>>> 
>>>>> On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell <
>>> kevin.od...@cloudera.com
>>>>>> wrote:
>>>>> 
>>>>>> To Mohammad's point.  You can use HBase for quick scans of the
>> data.
>>>>> Hive
>>>>>> for your longer running jobs.  Impala over the two for quick adhoc
>>>>>> searches.
>>>>>> 
>>>>>> On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq <
>> donta...@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> I am not saying Hbase is not good. My point was to consider Hive
>> as
>>>>> well.
>>>>>>> Think about the approach keeping both the tools in mind and
>>> decide. I
>>>>>> just
>>>>>>> provided an option keeping in mind the available built-in Hive
>>>>> features.
>>>>>> I
>>>>>>> would like to add one more point here, you can map your Hbase
>>> tables
>>>> to
>>>>>>> Hive.
>>>>>>> 
>>>>>>> Regards,
>>>>>>>    Mohammad Tariq
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Thu, Dec 13, 2012 at 7:58 PM, bigdata <
>> bigdatab...@outlook.com>
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi, Tariq
>>>>>>>> Thanks for your feedback. Actually, now we have two ways to
>> reach
>>>> the
>>>>>>>> target, by Hive and  by HBase.Could you tell me why HBase is
>> not
>>>> good
>>>>>> for
>>>>>>>> my requirements?Or what's the problem in my solution?
>>>>>>>> Thanks.
>>>>>>>> 
>>>>>>>>> From: donta...@gmail.com
>>>>>>>>> Date: Thu, 13 Dec 2012 15:43:25 +0530
>>>>>>>>> Subject: Re: How to design a data warehouse in HBase?
>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>> 
>>>>>>>>> Both have got different purposes. Normally people say that
>> Hive
>>>> is
>>>>>>> slow,
>>>>>>>>> that's just because it uses MapReduce under the hood. And i'm
>>>> sure
>>>>>> that
>>>>>>>> if
>>>>>>>>> the data stored in HBase is very huge, nobody would write
>>>>> sequential
>>>>>>>>> programs for Get or Scan. Instead they will write MP jobs or
>> do
>>>>>>> something
>>>>>>>>> similar.
>>>>>>>>> 
>>>>>>>>> My point is that nothing can be 100% real time. Is that what
>>> you
>>>>>>> want?If
>>>>>>>>> that is the case I would never suggest Hadoop on the first
>>> place
>>>> as
>>>>>>> it's
>>>>>>>> a
>>>>>>>>> batch processing system and cannot be used like an OLTP
>> system,
>>>>>> unless
>>>>>>>> you
>>>>>>>>> have thought of some additional stuff. Since you are talking
>>>> about
>>>>>>>>> warehouse, I am assuming you are going to store and process
>>>>> gigantic
>>>>>>>>> amounts of data. That's the only reason I had suggested Hive.
>>>>>>>>> 
>>>>>>>>> The whole point is that everything is not a solution for
>>>>> everything.
>>>>>>> One
>>>>>>>>> size doesn't fit all. First, we need to analyze our
>> particular
>>>> use
>>>>>>> case.
>>>>>>>>> The person, who says Hive is slow, might be correct. But only
>>> for
>>>>> his
>>>>>>>>> scenario.
>>>>>>>>> 
>>>>>>>>> HTH
>>>>>>>>> 
>>>>>>>>> Regards,
>>>>>>>>>    Mohammad Tariq
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On Thu, Dec 13, 2012 at 3:17 PM, bigdata <
>>>> bigdatab...@outlook.com>
>>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> I've got the information that HIVE 's performance is too
>> low.
>>>> It
>>>>>>> access
>>>>>>>>>> HDFS files and scan all data to search one record. IS it
>>> TRUE?
>>>>> And
>>>>>>>> HBase is
>>>>>>>>>> much faster than it.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> From: donta...@gmail.com
>>>>>>>>>>> Date: Thu, 13 Dec 2012 15:12:25 +0530
>>>>>>>>>>> Subject: Re: How to design a data warehouse in HBase?
>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>> 
>>>>>>>>>>> Hi there,
>>>>>>>>>>> 
>>>>>>>>>>>   If you are really planning for a warehousing solution
>>>> then I
>>>>>>> would
>>>>>>>>>>> suggest you to have a look over Apache Hive. It provides
>>> you
>>>>>>>> warehousing
>>>>>>>>>>> capabilities on top of a Hadoop cluster. Along with that
>> it
>>>>> also
>>>>>>>> provides
>>>>>>>>>>> an SQLish interface to the data stored in your warehouse,
>>>> which
>>>>>>>> would be
>>>>>>>>>>> very helpful to you, in case you are coming from an SQL
>>>>>> background.
>>>>>>>>>>> 
>>>>>>>>>>> HTH
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Regards,
>>>>>>>>>>>    Mohammad Tariq
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Thu, Dec 13, 2012 at 2:43 PM, bigdata <
>>>>>> bigdatab...@outlook.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Thanks. I think a real example is better for me to
>>>> understand
>>>>>>> your
>>>>>>>>>>>> suggestions.
>>>>>>>>>>>> Now I have a relational table:ID   LoginTime
>>>>>>>>>> DeviceID1
>>>>>>>>>>>>    2012-12-12 12:12:12   abcdef2     2012-12-12
>> 19:12:12
>>>>>>>> abcdef3
>>>>>>>>>>>> 2012-12-13 10:10:10  defdaf
>>>>>>>>>>>> There are several requirements about this table:1. How
>>> many
>>>>>>> device
>>>>>>>>>> login
>>>>>>>>>>>> in each day?1. For one day, how many new device login?
>>>> (never
>>>>>>> login
>>>>>>>>>>>> before)1. For one day, how many accumulated device
>> login?
>>>>>>>>>>>> How can I design HBase tables to calculate these
>> data?Now
>>>> my
>>>>>>>> solution
>>>>>>>>>>>> is:table A:
>>>>>>>>>>>> rowkey:  date-deviceidcolumn family: logincolumn
>>> qualifier:
>>>>>>>> 2012-12-12
>>>>>>>>>>>> 12:12:12/2012-12-12 19:12:12....
>>>>>>>>>>>> table B:rowkey: deviceidcolumn family:null or anyvalue
>>>>>>>>>>>> 
>>>>>>>>>>>> For req#1, I can scan table A and use
>>> prefixfilter(rowkey)
>>>> to
>>>>>>>> check one
>>>>>>>>>>>> special date, and get records countFor req#2, I get
>>> table b
>>>>>> with
>>>>>>>> each
>>>>>>>>>>>> deviceid, and count result
>>>>>>>>>>>> For req#3, count table A with prefixfilter like 1.
>>>>>>>>>>>> Does it OK?  Or other better solutions?
>>>>>>>>>>>> Thanks!!
>>>>>>>>>>>> 
>>>>>>>>>>>>> CC: user@hbase.apache.org
>>>>>>>>>>>>> From: michael_se...@hotmail.com
>>>>>>>>>>>>> Subject: Re: How to design a data warehouse in HBase?
>>>>>>>>>>>>> Date: Thu, 13 Dec 2012 08:43:31 +0000
>>>>>>>>>>>>> To: user@hbase.apache.org
>>>>>>>>>>>>> 
>>>>>>>>>>>>> You need to spend a bit of time on Schema design.
>>>>>>>>>>>>> You need to flatten your Schema...
>>>>>>>>>>>>> Implement some secondary indexing to improve join
>>>>>>> performance...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Depends on what you want to do... There are other
>>> options
>>>>>>> too...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Sent from a remote device. Please excuse any typos...
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Mike Segel
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Dec 13, 2012, at 7:09 AM, lars hofhansl <
>>>>>>> lhofha...@yahoo.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> For OLAP type queries you will generally be better
>>> off
>>>>>> with a
>>>>>>>> truly
>>>>>>>>>>>> column oriented database.
>>>>>>>>>>>>>> You can probably shoehorn HBase into this, but it
>>>> wasn't
>>>>>>> really
>>>>>>>>>>>> designed with raw scan performance along single columns
>>> in
>>>>>> mind.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> ________________________________
>>>>>>>>>>>>>> From: bigdata <bigdatab...@outlook.com>
>>>>>>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org
>>> 
>>>>>>>>>>>>>> Sent: Wednesday, December 12, 2012 9:57 PM
>>>>>>>>>>>>>> Subject: How to design a data warehouse in HBase?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Dear all,
>>>>>>>>>>>>>> We have a traditional star-model data warehouse in
>>>> RDBMS,
>>>>>> now
>>>>>>>> we
>>>>>>>>>> want
>>>>>>>>>>>> to transfer it to HBase. After study HBase, I learn
>> that
>>>>> HBase
>>>>>> is
>>>>>>>>>> normally
>>>>>>>>>>>> can be query by rowkey.
>>>>>>>>>>>>>> 1.full rowkey (fastest)2.rowkey filter
>> (fast)3.column
>>>>>>>>>> family/qualifier
>>>>>>>>>>>> filter (slow)
>>>>>>>>>>>>>> How can I design the HBase tables to implement the
>>>>>> warehouse
>>>>>>>>>>>> functions, like:1.Query by DimensionA2.Query by
>>> DimensionA
>>>>> and
>>>>>>>>>>>> DimensionB3.Sum, count, distinct ...
>>>>>>>>>>>>>> From my opinion, I should create several HBase
>> tables
>>>>> with
>>>>>>> all
>>>>>>>>>>>> combinations of different dimensions as the rowkey.
>> This
>>>>>> solution
>>>>>>>> will
>>>>>>>>>> lead
>>>>>>>>>>>> to huge data duplication. Is there any good suggestions
>>> to
>>>>>> solve
>>>>>>>> it?
>>>>>>>>>>>>>> Thanks a lot!
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Kevin O'Dell
>>>>>> Customer Operations Engineer, Cloudera
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Kevin O'Dell
>>>> Customer Operations Engineer, Cloudera
> 
> 
> 
> -- 
> Kevin O'Dell
> Customer Operations Engineer, Cloudera

Re: How to design a data warehouse in HBase?

Reply via email to