I don't know that I would recommend Impala at this stage in its development. Sorry, it has a bit of growing up.
It's interesting, but no UDFs, right? Sent from a remote device. Please excuse any typos... Mike Segel On Dec 13, 2012, at 4:42 PM, "Kevin O'dell" <kevin.od...@cloudera.com> wrote: > Correct, Impala relies on the Hive Metastore. > > On Thu, Dec 13, 2012 at 11:38 AM, Manoj Babu <manoj...@gmail.com> wrote: > >> Kevin, >> >> Impala requires Hive right? >> so to get the advantages of Impala do we need to go with Hive? >> >> >> Cheers! >> Manoj. >> >> >> >> On Thu, Dec 13, 2012 at 9:03 PM, Mohammad Tariq <donta...@gmail.com> >> wrote: >> >>> Thank you so much for the clarification Kevin. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Thu, Dec 13, 2012 at 9:00 PM, Kevin O'dell <kevin.od...@cloudera.com >>>> wrote: >>> >>>> Mohammad, >>>> >>>> I am not sure you are thinking about Impala correctly. It still uses >>>> HDFS so your data increasing over time is fine. You are not going to >>> need >>>> to tune for special CPU, Storage, or Network. Typically with Impala >> you >>>> are going to be bound at the disks as it functions off of data >> locality. >>>> You can also use compression of Snappy, GZip, and BZip to help with >> the >>>> amount of data you are storing. You will not need to frequently update >>>> your hardware. >>>> >>>> On Thu, Dec 13, 2012 at 10:06 AM, Mohammad Tariq <donta...@gmail.com> >>>> wrote: >>>> >>>>> Oh yes..Impala..good point by Kevin. >>>>> >>>>> Kevin : Would it be appropriate if I say that I should go for Impala >> if >>>> my >>>>> data is not going to increase dramatically over time or if I have to >>> work >>>>> on only a subset of my BigData?Since Impala uses MPP, it may >>>>> require specialized hardware tuned for CPU, storage and network >>>> performance >>>>> for better results, which could become a problem if have to upgrade >> the >>>>> hardware frequently because of the growing data. >>>>> >>>>> Regards, >>>>> Mohammad Tariq >>>>> >>>>> >>>>> >>>>> On Thu, Dec 13, 2012 at 8:17 PM, Kevin O'dell < >>> kevin.od...@cloudera.com >>>>>> wrote: >>>>> >>>>>> To Mohammad's point. You can use HBase for quick scans of the >> data. >>>>> Hive >>>>>> for your longer running jobs. Impala over the two for quick adhoc >>>>>> searches. >>>>>> >>>>>> On Thu, Dec 13, 2012 at 9:44 AM, Mohammad Tariq < >> donta...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> I am not saying Hbase is not good. My point was to consider Hive >> as >>>>> well. >>>>>>> Think about the approach keeping both the tools in mind and >>> decide. I >>>>>> just >>>>>>> provided an option keeping in mind the available built-in Hive >>>>> features. >>>>>> I >>>>>>> would like to add one more point here, you can map your Hbase >>> tables >>>> to >>>>>>> Hive. >>>>>>> >>>>>>> Regards, >>>>>>> Mohammad Tariq >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 13, 2012 at 7:58 PM, bigdata < >> bigdatab...@outlook.com> >>>>>> wrote: >>>>>>> >>>>>>>> Hi, Tariq >>>>>>>> Thanks for your feedback. Actually, now we have two ways to >> reach >>>> the >>>>>>>> target, by Hive and by HBase.Could you tell me why HBase is >> not >>>> good >>>>>> for >>>>>>>> my requirements?Or what's the problem in my solution? >>>>>>>> Thanks. >>>>>>>> >>>>>>>>> From: donta...@gmail.com >>>>>>>>> Date: Thu, 13 Dec 2012 15:43:25 +0530 >>>>>>>>> Subject: Re: How to design a data warehouse in HBase? >>>>>>>>> To: user@hbase.apache.org >>>>>>>>> >>>>>>>>> Both have got different purposes. Normally people say that >> Hive >>>> is >>>>>>> slow, >>>>>>>>> that's just because it uses MapReduce under the hood. And i'm >>>> sure >>>>>> that >>>>>>>> if >>>>>>>>> the data stored in HBase is very huge, nobody would write >>>>> sequential >>>>>>>>> programs for Get or Scan. Instead they will write MP jobs or >> do >>>>>>> something >>>>>>>>> similar. >>>>>>>>> >>>>>>>>> My point is that nothing can be 100% real time. Is that what >>> you >>>>>>> want?If >>>>>>>>> that is the case I would never suggest Hadoop on the first >>> place >>>> as >>>>>>> it's >>>>>>>> a >>>>>>>>> batch processing system and cannot be used like an OLTP >> system, >>>>>> unless >>>>>>>> you >>>>>>>>> have thought of some additional stuff. Since you are talking >>>> about >>>>>>>>> warehouse, I am assuming you are going to store and process >>>>> gigantic >>>>>>>>> amounts of data. That's the only reason I had suggested Hive. >>>>>>>>> >>>>>>>>> The whole point is that everything is not a solution for >>>>> everything. >>>>>>> One >>>>>>>>> size doesn't fit all. First, we need to analyze our >> particular >>>> use >>>>>>> case. >>>>>>>>> The person, who says Hive is slow, might be correct. But only >>> for >>>>> his >>>>>>>>> scenario. >>>>>>>>> >>>>>>>>> HTH >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Mohammad Tariq >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Dec 13, 2012 at 3:17 PM, bigdata < >>>> bigdatab...@outlook.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> I've got the information that HIVE 's performance is too >> low. >>>> It >>>>>>> access >>>>>>>>>> HDFS files and scan all data to search one record. IS it >>> TRUE? >>>>> And >>>>>>>> HBase is >>>>>>>>>> much faster than it. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> From: donta...@gmail.com >>>>>>>>>>> Date: Thu, 13 Dec 2012 15:12:25 +0530 >>>>>>>>>>> Subject: Re: How to design a data warehouse in HBase? >>>>>>>>>>> To: user@hbase.apache.org >>>>>>>>>>> >>>>>>>>>>> Hi there, >>>>>>>>>>> >>>>>>>>>>> If you are really planning for a warehousing solution >>>> then I >>>>>>> would >>>>>>>>>>> suggest you to have a look over Apache Hive. It provides >>> you >>>>>>>> warehousing >>>>>>>>>>> capabilities on top of a Hadoop cluster. Along with that >> it >>>>> also >>>>>>>> provides >>>>>>>>>>> an SQLish interface to the data stored in your warehouse, >>>> which >>>>>>>> would be >>>>>>>>>>> very helpful to you, in case you are coming from an SQL >>>>>> background. >>>>>>>>>>> >>>>>>>>>>> HTH >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Mohammad Tariq >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Dec 13, 2012 at 2:43 PM, bigdata < >>>>>> bigdatab...@outlook.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks. I think a real example is better for me to >>>> understand >>>>>>> your >>>>>>>>>>>> suggestions. >>>>>>>>>>>> Now I have a relational table:ID LoginTime >>>>>>>>>> DeviceID1 >>>>>>>>>>>> 2012-12-12 12:12:12 abcdef2 2012-12-12 >> 19:12:12 >>>>>>>> abcdef3 >>>>>>>>>>>> 2012-12-13 10:10:10 defdaf >>>>>>>>>>>> There are several requirements about this table:1. How >>> many >>>>>>> device >>>>>>>>>> login >>>>>>>>>>>> in each day?1. For one day, how many new device login? >>>> (never >>>>>>> login >>>>>>>>>>>> before)1. For one day, how many accumulated device >> login? >>>>>>>>>>>> How can I design HBase tables to calculate these >> data?Now >>>> my >>>>>>>> solution >>>>>>>>>>>> is:table A: >>>>>>>>>>>> rowkey: date-deviceidcolumn family: logincolumn >>> qualifier: >>>>>>>> 2012-12-12 >>>>>>>>>>>> 12:12:12/2012-12-12 19:12:12.... >>>>>>>>>>>> table B:rowkey: deviceidcolumn family:null or anyvalue >>>>>>>>>>>> >>>>>>>>>>>> For req#1, I can scan table A and use >>> prefixfilter(rowkey) >>>> to >>>>>>>> check one >>>>>>>>>>>> special date, and get records countFor req#2, I get >>> table b >>>>>> with >>>>>>>> each >>>>>>>>>>>> deviceid, and count result >>>>>>>>>>>> For req#3, count table A with prefixfilter like 1. >>>>>>>>>>>> Does it OK? Or other better solutions? >>>>>>>>>>>> Thanks!! >>>>>>>>>>>> >>>>>>>>>>>>> CC: user@hbase.apache.org >>>>>>>>>>>>> From: michael_se...@hotmail.com >>>>>>>>>>>>> Subject: Re: How to design a data warehouse in HBase? >>>>>>>>>>>>> Date: Thu, 13 Dec 2012 08:43:31 +0000 >>>>>>>>>>>>> To: user@hbase.apache.org >>>>>>>>>>>>> >>>>>>>>>>>>> You need to spend a bit of time on Schema design. >>>>>>>>>>>>> You need to flatten your Schema... >>>>>>>>>>>>> Implement some secondary indexing to improve join >>>>>>> performance... >>>>>>>>>>>>> >>>>>>>>>>>>> Depends on what you want to do... There are other >>> options >>>>>>> too... >>>>>>>>>>>>> >>>>>>>>>>>>> Sent from a remote device. Please excuse any typos... >>>>>>>>>>>>> >>>>>>>>>>>>> Mike Segel >>>>>>>>>>>>> >>>>>>>>>>>>> On Dec 13, 2012, at 7:09 AM, lars hofhansl < >>>>>>> lhofha...@yahoo.com> >>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> For OLAP type queries you will generally be better >>> off >>>>>> with a >>>>>>>> truly >>>>>>>>>>>> column oriented database. >>>>>>>>>>>>>> You can probably shoehorn HBase into this, but it >>>> wasn't >>>>>>> really >>>>>>>>>>>> designed with raw scan performance along single columns >>> in >>>>>> mind. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> ________________________________ >>>>>>>>>>>>>> From: bigdata <bigdatab...@outlook.com> >>>>>>>>>>>>>> To: "user@hbase.apache.org" <user@hbase.apache.org >>> >>>>>>>>>>>>>> Sent: Wednesday, December 12, 2012 9:57 PM >>>>>>>>>>>>>> Subject: How to design a data warehouse in HBase? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Dear all, >>>>>>>>>>>>>> We have a traditional star-model data warehouse in >>>> RDBMS, >>>>>> now >>>>>>>> we >>>>>>>>>> want >>>>>>>>>>>> to transfer it to HBase. After study HBase, I learn >> that >>>>> HBase >>>>>> is >>>>>>>>>> normally >>>>>>>>>>>> can be query by rowkey. >>>>>>>>>>>>>> 1.full rowkey (fastest)2.rowkey filter >> (fast)3.column >>>>>>>>>> family/qualifier >>>>>>>>>>>> filter (slow) >>>>>>>>>>>>>> How can I design the HBase tables to implement the >>>>>> warehouse >>>>>>>>>>>> functions, like:1.Query by DimensionA2.Query by >>> DimensionA >>>>> and >>>>>>>>>>>> DimensionB3.Sum, count, distinct ... >>>>>>>>>>>>>> From my opinion, I should create several HBase >> tables >>>>> with >>>>>>> all >>>>>>>>>>>> combinations of different dimensions as the rowkey. >> This >>>>>> solution >>>>>>>> will >>>>>>>>>> lead >>>>>>>>>>>> to huge data duplication. Is there any good suggestions >>> to >>>>>> solve >>>>>>>> it? >>>>>>>>>>>>>> Thanks a lot! >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Kevin O'Dell >>>>>> Customer Operations Engineer, Cloudera >>>> >>>> >>>> >>>> -- >>>> Kevin O'Dell >>>> Customer Operations Engineer, Cloudera > > > > -- > Kevin O'Dell > Customer Operations Engineer, Cloudera