[
https://issues.apache.org/jira/browse/KYLIN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
fengYu updated KYLIN-1172:
--------------------------
Attachment: 0001-support-more-hives-depend-on-different-hadoop.patch
> kylin support multi-hive on different hadoop cluster
> ----------------------------------------------------
>
> Key: KYLIN-1172
> URL: https://issues.apache.org/jira/browse/KYLIN-1172
> Project: Kylin
> Issue Type: Improvement
> Affects Versions: v1.0
> Reporter: fengYu
> Attachments: 0001-support-more-hives-depend-on-different-hadoop.patch
>
>
> Hi, I recently modify kylin to support multi-hive on different hadoop
> cluster and take them as input source to kylin, we do this since the
> following reasons:
> 1、we have more than one hadoop cluster and many hive depend on them(products
> may has its own hive), we cannot migrate those hives to one and don't want to
> deploy one kylin for every hive source.
> 2、our hadoop cluster deploy in different DC, we need to support them in one
> kylin instance.
> 3、source data in hive is much less than hfile, so copy those files cross
> different different is more efficient(fact distinct column job and base
> cuboid job need take datas at hive as input), so we deploy hbase and hadoop
> in one DC (separated in different HDFS).
> So, we divide data flow into 3 parts, hive is input source, hadoop do
> computing which will generate many temporary files, hbase is output. After
> cube building, queries on kylin just interactive with hbase. therefore, what
> we need to do is how to build cube base on differnet hives and hadoops.
> Our method are summarized below :
> 1、Deploy hive and hadoops, before start kylin, user should deploy all hives
> and hadoop, and ensure you can run hive sql in ./hive. and access every HDFS
> with 'hadoop fs 'command(add more nameservice in hdfs-site.xml).
> 2、Divide hives into two part: the hive that used when kylin start(we call it
> default one) and others are additional, we should allocate a name for every
> hive (default one is null), For simplicity, we just add a config property
> that tells root directory of all hive client, and every hive client is a
> directory whose name is the hive name(default one do not need locate in).
> 3、Attach only a hive to one project , so when creating a project, you should
> specify a hive name, and according to it we can find the hive client(include
> hive command and config files).
> 4、when load table in one project, find the hive-site.xml and create a
> HiveClient using this config file.
> 5、can not take HCatInputFormat as inputFormat in FactDistinctColumnsJob, so
> we change the job and take the intermediate hive table location as input file
> and change FactDistinctColumnsMapper. HiveColumnCardinalityJob will fail if
> we use additional hive.
> 6、Because we need to run MR in one hadoop cluster and input or output located
> at other HDFS, so when we set input location to real name node address
> instead of name service(this is a config property too).
> That is all we do, I think it can make things easy to manage more
> than one hives and hadoops. we have apply it in our env and it works well. I
> hope it can help other people...
> I will upload my patch later.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)