Re: Dimensional Data Model on Hive

Justin Coffey Thu, 10 May 2012 12:30:05 -0700

Hello,
   My thoughts are rather straightforward: it is best not to think of hive
as a data warehouse at all.  period.


It is better to think of it as SQL to MapReduce translation layer with some
meta data to help guide the process.

With this in mind, and if you really have lots of data, what you want to do
is denormalize everything to avoid any and all joins (even map side joins
if they have to happen on each record are costly).

Remember, you're not (really) querying indexed data (kinda not true, but
mostly valid).  You're querying distributed log files.

-Justin

On Thu, May 10, 2012 at 5:17 PM, Jagat <[email protected]> wrote:

> Hello
>
> Try to keep set of records which you need for particular analysis in same
> table. Generally we use Pig to feed data to hive tables and we have
> arranged our tables such that all the data which is to required for
> particular report is right present in that table. This helps to improve
> hive performance. While designing your schema Partition , index your tables
> depending on your queries.
>
> Fact and dimensions concept should not be taken too seriously here.
>
>
>
> On Thu, May 10, 2012 at 6:56 PM, Kuldeep Chitrakar <
> [email protected]> wrote:
>
>>  Hi ****
>>
>> ** **
>>
>> I have data warehouse implementation for Click Stream data analysis on
>> RDBMS. Its a start schema (Dimensions and Facts).****
>>
>> ** **
>>
>> Now if i want to move to Hive, Do i need to create same data model as
>> Dimensions and facts and join them. ****
>>
>> ** **
>>
>> I should create a big de-normalized table which contains all textual
>> attributes from all dimensions. If so how do we handle SCD 2 type
>> dimensions in Hive.****
>>
>> ** **
>>
>> Its very basic question but I am just confused on this.****
>>
>> ** **
>>
>> ** **
>>
>> Thanks,****
>>
>> Kuldeep****
>>
>
>


-- 
[email protected]
-----

Re: Dimensional Data Model on Hive

Reply via email to