[jira] [Updated] (KYLIN-1172) kylin support multi-hive on different hadoop cluster

fengYu (JIRA) Thu, 26 Nov 2015 23:10:02 -0800

     [ 
https://issues.apache.org/jira/browse/KYLIN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


fengYu updated KYLIN-1172:
--------------------------
    Attachment: 0001-support-more-hives-depend-on-different-hadoop.patch

> kylin support multi-hive on different hadoop cluster
> ----------------------------------------------------
>
>                 Key: KYLIN-1172
>                 URL: https://issues.apache.org/jira/browse/KYLIN-1172
>             Project: Kylin
>          Issue Type: Improvement
>    Affects Versions: v1.0
>            Reporter: fengYu
>         Attachments: 0001-support-more-hives-depend-on-different-hadoop.patch
>
>
>         Hi, I recently modify kylin to support multi-hive on different hadoop 
> cluster and take them as input source to kylin, we do this since the 
> following reasons:
> 1、we have more than one hadoop cluster and many hive depend on them(products 
> may has its own hive), we cannot migrate those hives to one and don't want to 
> deploy one kylin for every hive source. 
> 2、our hadoop cluster deploy in different DC, we need to support them in one 
> kylin instance.
> 3、source data in hive is much less than hfile, so copy those files cross 
> different different is more efficient(fact distinct column job and base 
> cuboid job need take datas at hive as input), so we deploy hbase and hadoop 
> in one DC (separated in different HDFS).
>         So, we divide data flow into 3 parts, hive is input source, hadoop do 
> computing which will generate many temporary files, hbase is output. After 
> cube building, queries on kylin just interactive with hbase. therefore, what 
> we need to do is how to build cube base on differnet hives and hadoops.
>         Our method are summarized below :
> 1、Deploy hive and hadoops, before start kylin, user should deploy all hives 
> and hadoop, and ensure you can run hive sql in ./hive. and access every HDFS 
> with 'hadoop  fs  'command(add more nameservice in hdfs-site.xml).
> 2、Divide hives into two part: the hive that used when kylin start(we call it 
> default one) and others are additional, we should allocate a name for every 
> hive (default one is null), For simplicity, we just add a config property 
> that tells root directory of all hive client, and every hive client is a 
> directory whose name is the hive name(default one do not need locate in).  
> 3、Attach only a hive to one project , so when creating a project, you should 
> specify a hive name, and according to it we can find the hive client(include 
> hive command and config files).
> 4、when load table in one project, find the hive-site.xml and create a 
> HiveClient using this config file.
> 5、can not take HCatInputFormat as inputFormat in FactDistinctColumnsJob, so 
> we change the job and take the intermediate hive table location as input file 
> and change FactDistinctColumnsMapper. HiveColumnCardinalityJob will fail if 
> we use additional hive.
> 6、Because we need to run MR in one hadoop cluster and input or output located 
>  at other HDFS, so when we set input location to real name node address 
> instead of name service(this is a config property too).
>         That is all we do, I think it can make things easy to manage more 
> than one hives and hadoops. we have apply it in our env and it works well. I 
> hope it can help other people... 
> I will upload my patch later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (KYLIN-1172) kylin support multi-hive on different hadoop cluster

Reply via email to