Hi guys, I uploaded the initial design document to JIRA, please feel free to comment:
https://issues.apache.org/jira/browse/KYLIN-3621 ShaoFeng Shi <shaofeng...@apache.org> 于2018年10月12日周五 上午9:44写道: > JIRA and sub-tasks are created for this. Welcome to comment there: > https://issues.apache.org/jira/browse/KYLIN-3621 > > ShaoFeng Shi <shaofeng...@apache.org> 于2018年10月8日周一 下午2:45写道: > >> I agree; the new storage should be Hadoop/HDFS compliant, and also need >> be cloud storage (like S3, blob storage) friendly, as more and more users >> are running big data analytics in the cloud. >> >> Luke Han <luke...@gmail.com> 于2018年10月7日周日 下午7:44写道: >> >>> It makes sense to bring a better storage option for Kylin. >>> >>> The option should be open and people could have different ways to create >>> an >>> adaptor for the underlying storage. >>> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I >>> prefer for Parquet or ORC or other HDFS compatible option at this time. >>> It >>> will easy for people to upgrade to the next generation and keep >>> consistency. >>> >>> Looking forward to this feature to be rolled out soon. >>> >>> Thanks. >>> >>> >>> >>> Best Regards! >>> --------------------- >>> >>> Luke Han >>> >>> >>> On Wed, Oct 3, 2018 at 2:37 PM Li Yang <liy...@apache.org> wrote: >>> >>> > Love this discussion. Like to highlight 3 major roles HBase is playing >>> > currently, so we don't miss any of them when looking for a replacement. >>> > >>> > 1) Storage: A high speed big data storage >>> > 2) Cache: A distributed storage cache layer (was BlockCache) >>> > 3) MPP: A distributed computation framework (was Coprocessor) >>> > >>> > The "Storage" seems at the central of discussion. Be it Parquet, ORC, >>> or a >>> > new file format, to me the standard interface is most important. As >>> long as >>> > we have consensus on the access interface, like MapReduce / Spark >>> Dataset, >>> > then the rest of debate can be easily resolved by a fair benchmark. >>> Also it >>> > allows people with different preference to keep their own >>> implementation >>> > under the standard interface, and not impacting the rest of Kylin. >>> > >>> > The "Cache" and the "MPP" were more or less overlooked. I suggest we >>> pay >>> > more attentions to them. Apart from Spark and Alluxio, any other >>> > alternatives? Actually Druid is a well-rounded choice, as like HBase, >>> it >>> > covers all the 3 roles pretty well. >>> > >>> > In general, I prefer to choose from the state of the art instead of >>> > re-inventing. Indeed, Kylin is not a storage project. A new storage >>> format >>> > is not Kylin's mission. Any storage innovations we come across here >>> would >>> > be more beneficial if contribute to Parquet or ORC community. >>> > >>> > Regards >>> > Yang >>> > >>> > >>> > >>> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi <shaofeng...@apache.org> >>> > wrote: >>> > >>> > > Hi Billy, >>> > > >>> > > Yes, the cloud storage should be considered. The traditional file >>> layouts >>> > > on HDFS may not work well on cloud storage. Kylin needs to allow >>> > extension >>> > > here. I will add this to the requirement. >>> > > >>> > > Billy Liu <billy...@apache.org> 于2018年9月29日周六 下午3:22写道: >>> > > >>> > > > Hi Shaofeng, >>> > > > >>> > > > I'd like to add one more character: cloud-native storage support. >>> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage >>> on >>> > > > Azure. If new storage engine could be more cloud friendly, more >>> user >>> > > > could get benefits from it. >>> > > > >>> > > > With Warm regards >>> > > > >>> > > > Billy Liu >>> > > > ShaoFeng Shi <shaofeng...@apache.org> 于2018年9月28日周五 下午2:15写道: >>> > > > > >>> > > > > Hi Kylin developers. >>> > > > > >>> > > > > HBase has been Kylin’s storage engine since the first day; Kylin >>> on >>> > > HBase >>> > > > > has been verified as a success which can support low latency & >>> high >>> > > > > concurrency queries on a very large data scale. Thanks to HBase, >>> most >>> > > > Kylin >>> > > > > users can get on average less than 1-second query response. >>> > > > > >>> > > > > But we also see some limitations when putting Cubes into HBase; I >>> > > shared >>> > > > > some of them in the HBaseConf Asia 2018[1] this August. The >>> typical >>> > > > > limitations include: >>> > > > > >>> > > > > - Rowkey is the primary index, no secondary index so far; >>> > > > > >>> > > > > Filtering by row key’s prefix and suffix can get very different >>> > > > performance >>> > > > > result. So the user needs to do a good design about the row key; >>> > > > otherwise, >>> > > > > the query would be slow. This is difficult sometimes because the >>> user >>> > > > might >>> > > > > not predict the filtering patterns ahead of cube design. >>> > > > > >>> > > > > - HBase is a key-value instead of a columnar storage >>> > > > > >>> > > > > Kylin combines multiple measures (columns) into fewer column >>> families >>> > > for >>> > > > > smaller data size (row key size is remarkable). This causes HBase >>> > often >>> > > > > needing to read more data than requested. >>> > > > > >>> > > > > - HBase couldn't run on YARN >>> > > > > >>> > > > > This makes the deployment and auto-scaling a little complicated, >>> > > > especially >>> > > > > in the cloud. >>> > > > > >>> > > > > In one word, HBase is complicated to be Kylin’s storage. The >>> > > maintenance, >>> > > > > debugging is also hard for normal developers. Now we’re planning >>> to >>> > > seek >>> > > > a >>> > > > > simple, light-weighted, read-only storage engine for Kylin. The >>> new >>> > > > > solution should have the following characteristics: >>> > > > > >>> > > > > - Columnar layout with compression for efficient I/O; >>> > > > > - Index by each column for quick filtering and seeking; >>> > > > > - MapReduce / Spark API for parallel processing; >>> > > > > - HDFS compliant for scalability and availability; >>> > > > > - Mature, stable and extensible; >>> > > > > >>> > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding >>> > > multiple >>> > > > > storages to Kylin is possible. Some companies like Kyligence Inc >>> and >>> > > > > Meituan.com, have developed their customized storage engine for >>> Kylin >>> > > in >>> > > > > their product or platform. In their experience, columnar storage >>> is a >>> > > > good >>> > > > > supplement for the HBase engine. Kaisen Kang from Meituan.com has >>> > > shared >>> > > > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin >>> meetup >>> > in >>> > > > > Beijing. >>> > > > > >>> > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the >>> next >>> > > phase. >>> > > > > Parquet is a standard columnar file format and has been widely >>> > > supported >>> > > > by >>> > > > > many projects like Hive, Impala, Drill, etc. Parquet is adding >>> the >>> > page >>> > > > > level column index to support fine-grained filtering. Apache >>> Spark >>> > can >>> > > > > provide the parallel computing over Parquet and can be deployed >>> on >>> > > > > YARN/Mesos and Kubernetes. With this combination, the data >>> > persistence >>> > > > and >>> > > > > computation are separated, which makes the scaling in/out much >>> easier >>> > > > than >>> > > > > before. Benefiting from Spark's flexibility, we can not only push >>> > down >>> > > > more >>> > > > > computation from Kylin to the Hadoop cluster. Except for Parquet, >>> > > Apache >>> > > > > ORC is also a candidate. >>> > > > > >>> > > > > Now I raise this discussion to get your ideas about Kylin’s >>> > > > next-generation >>> > > > > storage engine. If you have good ideas or any related data, >>> welcome >>> > > > discuss in >>> > > > > the community. >>> > > > > >>> > > > > Thank you! >>> > > > > >>> > > > > [1] Apache Kylin on HBase >>> > > > > >>> > > > >>> > > >>> > >>> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data >>> > > > > [2] Apache Kylin Plugin Architecture >>> > > > > https://kylin.apache.org/development/plugin_arch.html >>> > > > > [3] 基于Druid的Kylin存储引擎实践 >>> > > > https://blog.bcmeng.com/post/kylin-on-druid.html-- >>> > > > > Best regards, >>> > > > > >>> > > > > Shaofeng Shi 史少锋 >>> > > > >>> > > >>> > > >>> > > -- >>> > > Best regards, >>> > > >>> > > Shaofeng Shi 史少锋 >>> > > >>> > >>> >> >> >> -- >> Best regards, >> >> Shaofeng Shi 史少锋 >> >> > > -- > Best regards, > > Shaofeng Shi 史少锋 > > -- Best regards, Shaofeng Shi 史少锋