[
https://issues.apache.org/jira/browse/KYLIN-1172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15029537#comment-15029537
]
fengYu commented on KYLIN-1172:
-------------------------------
In "Create Intermediate Flat Hive Table" step, hive is running in Local client
mode and the do not need pull data cross cluster, However, in "Extract Fact
Table Distinct Columns" and "Build Base Cuboid Data" step, it will pull data
from one hadoop and running job at another, in our test(two hadoop in one city
and do not at one DC), a cube with 5 normal dimension, source data is 8000w
rows, Intermediate table size is 1.5GB, I use default queue while submitting
MR job, those two job will take nearly 2 mins(do not has obvious difference),
it's just a reference, many things will affect the speed, such as cluster
location, networks, load of hadoop cluster, etc.
In our env, we always build cube at night, so this maybe not the most
important, Be prepared for the worst, we could add a step after create
Intermediate table to copy from one cluster to computing cluster that will cut
read the table only once, But now, I think It's not necessary.
> kylin support multi-hive on different hadoop cluster
> ----------------------------------------------------
>
> Key: KYLIN-1172
> URL: https://issues.apache.org/jira/browse/KYLIN-1172
> Project: Kylin
> Issue Type: Improvement
> Affects Versions: v1.0
> Reporter: fengYu
>
> Hi, I recently modify kylin to support multi-hive on different hadoop
> cluster and take them as input source to kylin, we do this since the
> following reasons:
> 1、we have more than one hadoop cluster and many hive depend on them(products
> may has its own hive), we cannot migrate those hives to one and don't want to
> deploy one kylin for every hive source.
> 2、our hadoop cluster deploy in different DC, we need to support them in one
> kylin instance.
> 3、source data in hive is much less than hfile, so copy those files cross
> different different is more efficient(fact distinct column job and base
> cuboid job need take datas at hive as input), so we deploy hbase and hadoop
> in one DC (separated in different HDFS).
> So, we divide data flow into 3 parts, hive is input source, hadoop do
> computing which will generate many temporary files, hbase is output. After
> cube building, queries on kylin just interactive with hbase. therefore, what
> we need to do is how to build cube base on differnet hives and hadoops.
> Our method are summarized below :
> 1、Deploy hive and hadoops, before start kylin, user should deploy all hives
> and hadoop, and ensure you can run hive sql in ./hive. and access every HDFS
> with 'hadoop fs 'command(add more nameservice in hdfs-site.xml).
> 2、Divide hives into two part: the hive that used when kylin start(we call it
> default one) and others are additional, we should allocate a name for every
> hive (default one is null), For simplicity, we just add a config property
> that tells root directory of all hive client, and every hive client is a
> directory whose name is the hive name(default one do not need locate in).
> 3、Attach only a hive to one project , so when creating a project, you should
> specify a hive name, and according to it we can find the hive client(include
> hive command and config files).
> 4、when load table in one project, find the hive-site.xml and create a
> HiveClient using this config file.
> 5、can not take HCatInputFormat as inputFormat in FactDistinctColumnsJob, so
> we change the job and take the intermediate hive table location as input file
> and change FactDistinctColumnsMapper. HiveColumnCardinalityJob will fail if
> we use additional hive.
> 6、Because we need to run MR in one hadoop cluster and input or output located
> at other HDFS, so when we set input location to real name node address
> instead of name service(this is a config property too).
> That is all we do, I think it can make things easy to manage more
> than one hives and hadoops. we have apply it in our env and it works well. I
> hope it can help other people...
> I will upload my patch later.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)