Re: [Request for comments] A client library to help automatic cube building/refreshing

hongbin ma Thu, 10 Dec 2015 19:06:07 -0800

Hi chunen

It's great if you could open source your related work! I'm not sure how you
guys implemented your kylin-tools (which sounds very functional), however I
may provide some "rules" that I think our client library should follow:

1. No direct access to metadata store. Client library should only
communicate with REST server for all of the requests. The reasons are
twofold: First, metadata store implementations might vary in different
kylin versions (whereas REST APIs hardly change), the changes should be
transparent to client libraries. Second, REST server checks whether a
request is valid (in terms of input validness, operation permission,
conflict correctness etc.) Skipping REST server could cause inconsistency
in kylin's state.

2. Better not to rely on external dependencies like mysql. Client library
is used by normal Kylin users, the may not be able to ensure the
dependencies are there. Code/Library level dependencies is acceptable,
however. (if a client is shipped as a jar, then its dependency on libraries
like log4j is acceptable)

3. The client library does not have to be a JAR. As in your case, python
script might be a good choose because it can integrate with common
scheduling frameworks. Anyway the purpose is to find a solution that is
easy to use and easy to maintain.

As to the feature list, the client library can cover lots of
functions(basically everything you can do with REST API). Based on your
descriptions we can category them into:

1.Job management, including job creation, job status check, job kill, *job
scheduling, job failover.*
2.Cube/Project management, including project create/delete, cube
create/delete/enable/disable, and batch create, etc.
3. Metadata management, including hive import, cache flush, etc.

It is obvious that function 1 is suitable to integrate with existing
scheduling frameworks like crontab, but should we put function 2,3 into the
client library?  It seems to me overkill for a light weight client
library(or should we call kylin-job-client to clear the context?)

My opinion is that job scheduling might be the most popular function for
the client library, because it can encapsulate complex logics which is not
replaceable by one simple REST call, and really reduces users' human
intervention. The other functions, however, seems to be replaceable by a
simple REST call or a simple click on web UI.

I'm looking forward to your inputs as well as others'.

On Thu, Dec 10, 2015 at 3:58 PM, nichunen <nichu...@mininglamp.com> wrote:

> Hi community,
>
> Indeed, we also find out that such library is a strong demand.
>
>
> Actually we have done many of the work Hongbin has mentioned in some kylin 
> environments of our clients and we call the project as kylin-tools which is 
> fully based on kylin rest apis.
>
> What we did is that we created python scripts capable of doing most of the 
> work that kylin's webapp can do.
>
>
> The scripts have defined many command line options, and can be intergrated 
> with crontab, oozie or other scheduling systems.The functionalities are 
> listed as below:
> 1. Simple cube definition.
>
> It reads  simple cube definition from a csv, and convert them into json data 
> so users can easily design their cubes and store them in a file.
> 2. Project create/delete, hive table synchronization, cache wipe
> 3. Cube batch create, build, enable/disable, delete
> 4. Auto check on job status, and support simple job failover
> 5. Cube information stored in mysql
> 6. Command lines to run kylin tasks
>
>
> But this tool or project was developed for customized demand of our clients, 
> and it definitely needs extra formalization and further development.
>
> Again we'd like to contribute as much as we can to the community, but first 
> of all we think we'd better discuss further on the features and design.
>
> Can you give us some advice and tell us your demand. If we can finish them, 
> it's our pleasure to make it open source.
>
> ------------------------------
>
> Best Regards,
>
>
>
> George/倪春恩
>
> Software Engineer/软件工程师
>
> Mobile:+86-13501723787| Fax:+8610-56842040
>
> 北京明略软件系统有限公司（www <http://www.semidata.com/>.mininglamp.com）
>
> 北京市昌平区东小口镇中东路398号中煤建设集团大厦1号楼4层
>
> F4,1#,Zhongmei Construction Group Plaza,398# Zhongdong Road,Changping
> District,Beijing,102218
>
>
> ----------------------------------------------------------------------------------------------------------------------------
>
> [image: cid:image003.jpg@01D076E2.04C796E0]
>
>
> *From:* hongbin ma <mahong...@apache.org>
> *Date:* 2015-12-10 13:55
> *To:* dev <dev@kylin.apache.org>
> *Subject:* [Request for comments] A client library to help automatic cube
> building/refreshing
> Currently most users create/build/refresh cubes via our website by manual
> click. Some of the advanced users might how Kylin provides REST APIs to
> operate on cubes(
> http://kylin.apache.org/docs/howto/howto_build_cube_with_restapi.html) It
> is our design purpose to leave the cubing job scheduling to the client
> side. We don't like the idea of integrating complex cubing job scheduling
> because this might complicate server side and frontend a lot.
>
> Yet we've seen a lot of user having the requirement of refreshing the cube
> everyday. Some even needs to update the latest N days' data everyday. Let's
> put aside how troublesome it is to click the build/refresh button everyday,
> or how much efforts the user needs to learn using Kylin REST API
> programatically. With no experienced guidance, the users tends to add a new
> segment as well as refresh the last N segments EVERYDAY, this is extremely
> inefficient and hurts query performance.
>
> A sophisticated solution for such cases would be: organize the cube
> segments by weeks/months/quarters (depending on how big N is, if N less
> than 30, usually by month is optimal). Let's say N=30, then each time user
> will only latest 2 segments need refreshing, this is much cheaper than the
> naive solution.
>
> However it's not so trivial for every kylin users to implement such
> scheduling algorithm at his client side. Not to mention the error handling
> logic, failover etc.  A more practical solution is that we provided a
> scheduler library(which can be treated a child project to Kylin) to him,
> and the user only needs to configure basic information like cube name, user
> credentials, refresh frequency, N days to back refresh, etc. The client
> library will take over for the resting dirty work.
>
> I'm posting the idea here to see the community's opinion on this. Since
> it's a really independent task (it's actually a independent project),
> volunteers are greatly welcomed to fully take charge of this task.
>
> 
>
> --
> Regards,
>
> *Bin Mahone | 马洪宾*
> Apache Kylin: http://kylin.io
> Github: https://github.com/binmahone
>
>

-- 
Regards,

*Bin Mahone | 马洪宾*
Apache Kylin: http://kylin.io
Github: https://github.com/binmahone

Re: [Request for comments] A client library to help automatic cube building/refreshing

Reply via email to