[DISCUSS] Not sending Github PR notifications to dev@kylin

2018-10-01 Thread ShaoFeng Shi
Hello, Kylin dev subscribers,

Recently I received several complaints saying that there are many emails
sent to the "dev@kylin.apache.org" from the github.com pull request since
we enabled the Gitbox service for Kylin.

Today most patches and code reviews are performed on GitHub. Each pull
request action (even add a comment) will emit an email to dev@kylin,
instead of the individual contributor or reviewer; This generates many
spams and causes the emails from people are left in the basket.

Now I plan to change the Gitbox email notifications rule: removing dev@kylin,
use author and reviewer instead, as follows:


*For Github issues, please notify iss...@kylin.apache.org
 ;For Github PR, please notify the author,
reviewer and iss...@kylin.apache.org *

The related JIRA to Apache Infra is
https://issues.apache.org/jira/browse/INFRA-17073

Please +1 if you agree with the new rule, or -1 if you want to keep as
today. If no objection, we will move on with the new rule.

-- 
Best regards,

Shaofeng Shi 史少锋


Re: [DISCUSS] Columnar storage engine for Apache Kylin

2018-10-01 Thread ShaoFeng Shi
Hi Billy,

Yes, the cloud storage should be considered. The traditional file layouts
on HDFS may not work well on cloud storage. Kylin needs to allow extension
here. I will add this to the requirement.

Billy Liu  于2018年9月29日周六 下午3:22写道:

> Hi Shaofeng,
>
> I'd like to add one more character: cloud-native storage support.
> Quite a few users are using S3 on AWS, or Azure Data Lake Storage on
> Azure. If new storage engine could be more cloud friendly, more user
> could get benefits from it.
>
> With Warm regards
>
> Billy Liu
> ShaoFeng Shi  于2018年9月28日周五 下午2:15写道:
> >
> > Hi Kylin developers.
> >
> > HBase has been Kylin’s storage engine since the first day; Kylin on HBase
> > has been verified as a success which can support low latency & high
> > concurrency queries on a very large data scale. Thanks to HBase, most
> Kylin
> > users can get on average less than 1-second query response.
> >
> > But we also see some limitations when putting Cubes into HBase; I shared
> > some of them in the HBaseConf Asia 2018[1] this August. The typical
> > limitations include:
> >
> >- Rowkey is the primary index, no secondary index so far;
> >
> > Filtering by row key’s prefix and suffix can get very different
> performance
> > result. So the user needs to do a good design about the row key;
> otherwise,
> > the query would be slow. This is difficult sometimes because the user
> might
> > not predict the filtering patterns ahead of cube design.
> >
> >- HBase is a key-value instead of a columnar storage
> >
> > Kylin combines multiple measures (columns) into fewer column families for
> > smaller data size (row key size is remarkable). This causes HBase often
> > needing to read more data than requested.
> >
> >- HBase couldn't run on YARN
> >
> > This makes the deployment and auto-scaling a little complicated,
> especially
> > in the cloud.
> >
> > In one word, HBase is complicated to be Kylin’s storage. The maintenance,
> > debugging is also hard for normal developers. Now we’re planning to seek
> a
> > simple, light-weighted, read-only storage engine for Kylin. The new
> > solution should have the following characteristics:
> >
> >- Columnar layout with compression for efficient I/O;
> >- Index by each column for quick filtering and seeking;
> >- MapReduce / Spark API for parallel processing;
> >- HDFS compliant for scalability and availability;
> >- Mature, stable and extensible;
> >
> > With the plugin architecture[2] introduced in Kylin 1.5, adding multiple
> > storages to Kylin is possible. Some companies like Kyligence Inc and
> > Meituan.com, have developed their customized storage engine for Kylin in
> > their product or platform. In their experience, columnar storage is a
> good
> > supplement for the HBase engine. Kaisen Kang from Meituan.com has shared
> > their KOD (Kylin on Druid) solution[3] in this August’s Kylin meetup in
> > Beijing.
> >
> > We plan to do a PoC with Apache Parquet + Apache Spark in the next phase.
> > Parquet is a standard columnar file format and has been widely supported
> by
> > many projects like Hive, Impala, Drill, etc. Parquet is adding the page
> > level column index to support fine-grained filtering.  Apache Spark can
> > provide the parallel computing over Parquet and can be deployed on
> > YARN/Mesos and Kubernetes. With this combination, the data persistence
> and
> > computation are separated, which makes the scaling in/out much easier
> than
> > before. Benefiting from Spark's flexibility, we can not only push down
> more
> > computation from Kylin to the Hadoop cluster. Except for Parquet, Apache
> > ORC is also a candidate.
> >
> > Now I raise this discussion to get your ideas about Kylin’s
> next-generation
> > storage engine. If you have good ideas or any related data, welcome
> discuss in
> > the community.
> >
> > Thank you!
> >
> > [1] Apache Kylin on HBase
> >
> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
> > [2] Apache Kylin Plugin Architecture
> > https://kylin.apache.org/development/plugin_arch.html
> > [3] 基于Druid的Kylin存储引擎实践
> https://blog.bcmeng.com/post/kylin-on-druid.html--
> > Best regards,
> >
> > Shaofeng Shi 史少锋
>


-- 
Best regards,

Shaofeng Shi 史少锋


[jira] [Created] (KYLIN-3607) can't build cube with spark in v2.5.0

2018-10-01 Thread ANIL KUMAR (JIRA)
ANIL KUMAR created KYLIN-3607:
-

 Summary: can't build cube with spark in v2.5.0
 Key: KYLIN-3607
 URL: https://issues.apache.org/jira/browse/KYLIN-3607
 Project: Kylin
  Issue Type: Bug
Reporter: ANIL KUMAR


in Kylin v2.5.0, can't be built cube at step 8 Convert Cuboid Data to HFile, 
the following is the related exception:

 

ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.RuntimeException: error execute 
org.apache.kylin.storage.hbase.steps.SparkCubeHFile. Root cause: Job aborted 
due to stage failure: Task 0 in stage 1.0 failed 4 times, 
java.lang.ExceptionInInitializerError
 at 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.getNewWriter(HFileOutputFormat2.java:247)
 at 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:194)
 at 
org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:152)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
 at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1353)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
 at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:99)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:325)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.RuntimeException: Could not create interface 
org.apache.hadoop.hbase.regionserver.MetricsRegionServerSourceFactory Is the 
hadoop compatibility jar on the classpath?
 at 
org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:73)
 at org.apache.hadoop.hbase.io.MetricsIO.(MetricsIO.java:31)
 at org.apache.hadoop.hbase.io.hfile.HFile.(HFile.java:192)
 ... 15 more
Caused by: java.util.NoSuchElementException
 at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:365)
 at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
 at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
 at 
org.apache.hadoop.hbase.CompatibilitySingletonFactory.getInstance(CompatibilitySingletonFactory.java:59)
 ... 17 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)