Re: [DISCUSS] Columnar storage engine for Apache Kylin

ShaoFeng Shi Tue, 16 Oct 2018 00:59:10 -0700

Hi guys,

I uploaded the initial design document to JIRA, please feel free to comment:


https://issues.apache.org/jira/browse/KYLIN-3621


ShaoFeng Shi <shaofeng...@apache.org> 于2018年10月12日周五 上午9:44写道：

> JIRA and sub-tasks are created for this. Welcome to comment there:
> https://issues.apache.org/jira/browse/KYLIN-3621
>
> ShaoFeng Shi <shaofeng...@apache.org> 于2018年10月8日周一 下午2:45写道：
>
>> I agree; the new storage should be Hadoop/HDFS compliant, and also need
>> be cloud storage (like S3, blob storage) friendly, as more and more users
>> are running big data analytics in the cloud.
>>
>> Luke Han <luke...@gmail.com> 于2018年10月7日周日 下午7:44写道：
>>
>>> It makes sense to bring a better storage option for Kylin.
>>>
>>> The option should be open and people could have different ways to create
>>> an
>>> adaptor for the underlying storage.
>>> Considering huge adoptions of Kylin today are all run on Hadoop/HDFS, I
>>> prefer for Parquet or ORC or other HDFS compatible option at this time.
>>> It
>>> will easy for people to upgrade to the next generation and keep
>>> consistency.
>>>
>>> Looking forward to this feature to be rolled out soon.
>>>
>>> Thanks.
>>>
>>>
>>>
>>> Best Regards!
>>> ---------------------
>>>
>>> Luke Han
>>>
>>>
>>> On Wed, Oct 3, 2018 at 2:37 PM Li Yang <liy...@apache.org> wrote:
>>>
>>> > Love this discussion. Like to highlight 3 major roles HBase is playing
>>> > currently, so we don't miss any of them when looking for a replacement.
>>> >
>>> > 1) Storage: A high speed big data storage
>>> > 2) Cache: A distributed storage cache layer (was BlockCache)
>>> > 3) MPP: A distributed computation framework (was Coprocessor)
>>> >
>>> > The "Storage" seems at the central of discussion. Be it Parquet, ORC,
>>> or a
>>> > new file format, to me the standard interface is most important. As
>>> long as
>>> > we have consensus on the access interface, like MapReduce / Spark
>>> Dataset,
>>> > then the rest of debate can be easily resolved by a fair benchmark.
>>> Also it
>>> > allows people with different preference to keep their own
>>> implementation
>>> > under the standard interface, and not impacting the rest of Kylin.
>>> >
>>> > The "Cache" and the "MPP" were more or less overlooked. I suggest we
>>> pay
>>> > more attentions to them. Apart from Spark and Alluxio, any other
>>> > alternatives? Actually Druid is a well-rounded choice, as like HBase,
>>> it
>>> > covers all the 3 roles pretty well.
>>> >
>>> > In general, I prefer to choose from the state of the art instead of
>>> > re-inventing. Indeed, Kylin is not a storage project. A new storage
>>> format
>>> > is not Kylin's mission. Any storage innovations we come across here
>>> would
>>> > be more beneficial if contribute to Parquet or ORC community.
>>> >
>>> > Regards
>>> > Yang
>>> >
>>> >
>>> >
>>> > On Tue, Oct 2, 2018 at 11:20 AM ShaoFeng Shi <shaofeng...@apache.org>
>>> > wrote:
>>> >
>>> > > Hi Billy,
>>> > >
>>> > > Yes, the cloud storage should be considered. The traditional file
>>> layouts
>>> > > on HDFS may not work well on cloud storage. Kylin needs to allow
>>> > extension
>>> > > here. I will add this to the requirement.
>>> > >
>>> > > Billy Liu <billy...@apache.org> 于2018年9月29日周六 下午3:22写道：
>>> > >
>>> > > > Hi Shaofeng,
>>> > > >
>>> > > > I'd like to add one more character: cloud-native storage support.
>>> > > > Quite a few users are using S3 on AWS, or Azure Data Lake Storage
>>> on
>>> > > > Azure. If new storage engine could be more cloud friendly, more
>>> user
>>> > > > could get benefits from it.
>>> > > >
>>> > > > With Warm regards
>>> > > >
>>> > > > Billy Liu
>>> > > > ShaoFeng Shi <shaofeng...@apache.org> 于2018年9月28日周五 下午2:15写道：
>>> > > > >
>>> > > > > Hi Kylin developers.
>>> > > > >
>>> > > > > HBase has been Kylin’s storage engine since the first day; Kylin
>>> on
>>> > > HBase
>>> > > > > has been verified as a success which can support low latency &
>>> high
>>> > > > > concurrency queries on a very large data scale. Thanks to HBase,
>>> most
>>> > > > Kylin
>>> > > > > users can get on average less than 1-second query response.
>>> > > > >
>>> > > > > But we also see some limitations when putting Cubes into HBase; I
>>> > > shared
>>> > > > > some of them in the HBaseConf Asia 2018[1] this August. The
>>> typical
>>> > > > > limitations include:
>>> > > > >
>>> > > > >    - Rowkey is the primary index, no secondary index so far;
>>> > > > >
>>> > > > > Filtering by row key’s prefix and suffix can get very different
>>> > > > performance
>>> > > > > result. So the user needs to do a good design about the row key;
>>> > > > otherwise,
>>> > > > > the query would be slow. This is difficult sometimes because the
>>> user
>>> > > > might
>>> > > > > not predict the filtering patterns ahead of cube design.
>>> > > > >
>>> > > > >    - HBase is a key-value instead of a columnar storage
>>> > > > >
>>> > > > > Kylin combines multiple measures (columns) into fewer column
>>> families
>>> > > for
>>> > > > > smaller data size (row key size is remarkable). This causes HBase
>>> > often
>>> > > > > needing to read more data than requested.
>>> > > > >
>>> > > > >    - HBase couldn't run on YARN
>>> > > > >
>>> > > > > This makes the deployment and auto-scaling a little complicated,
>>> > > > especially
>>> > > > > in the cloud.
>>> > > > >
>>> > > > > In one word, HBase is complicated to be Kylin’s storage. The
>>> > > maintenance,
>>> > > > > debugging is also hard for normal developers. Now we’re planning
>>> to
>>> > > seek
>>> > > > a
>>> > > > > simple, light-weighted, read-only storage engine for Kylin. The
>>> new
>>> > > > > solution should have the following characteristics:
>>> > > > >
>>> > > > >    - Columnar layout with compression for efficient I/O;
>>> > > > >    - Index by each column for quick filtering and seeking;
>>> > > > >    - MapReduce / Spark API for parallel processing;
>>> > > > >    - HDFS compliant for scalability and availability;
>>> > > > >    - Mature, stable and extensible;
>>> > > > >
>>> > > > > With the plugin architecture[2] introduced in Kylin 1.5, adding
>>> > > multiple
>>> > > > > storages to Kylin is possible. Some companies like Kyligence Inc
>>> and
>>> > > > > Meituan.com, have developed their customized storage engine for
>>> Kylin
>>> > > in
>>> > > > > their product or platform. In their experience, columnar storage
>>> is a
>>> > > > good
>>> > > > > supplement for the HBase engine. Kaisen Kang from Meituan.com has
>>> > > shared
>>> > > > > their KOD (Kylin on Druid) solution[3] in this August’s Kylin
>>> meetup
>>> > in
>>> > > > > Beijing.
>>> > > > >
>>> > > > > We plan to do a PoC with Apache Parquet + Apache Spark in the
>>> next
>>> > > phase.
>>> > > > > Parquet is a standard columnar file format and has been widely
>>> > > supported
>>> > > > by
>>> > > > > many projects like Hive, Impala, Drill, etc. Parquet is adding
>>> the
>>> > page
>>> > > > > level column index to support fine-grained filtering.  Apache
>>> Spark
>>> > can
>>> > > > > provide the parallel computing over Parquet and can be deployed
>>> on
>>> > > > > YARN/Mesos and Kubernetes. With this combination, the data
>>> > persistence
>>> > > > and
>>> > > > > computation are separated, which makes the scaling in/out much
>>> easier
>>> > > > than
>>> > > > > before. Benefiting from Spark's flexibility, we can not only push
>>> > down
>>> > > > more
>>> > > > > computation from Kylin to the Hadoop cluster. Except for Parquet,
>>> > > Apache
>>> > > > > ORC is also a candidate.
>>> > > > >
>>> > > > > Now I raise this discussion to get your ideas about Kylin’s
>>> > > > next-generation
>>> > > > > storage engine. If you have good ideas or any related data,
>>> welcome
>>> > > > discuss in
>>> > > > > the community.
>>> > > > >
>>> > > > > Thank you!
>>> > > > >
>>> > > > > [1] Apache Kylin on HBase
>>> > > > >
>>> > > >
>>> > >
>>> >
>>> https://www.slideshare.net/ShiShaoFeng1/apache-kylin-on-hbase-extreme-olap-engine-for-big-data
>>> > > > > [2] Apache Kylin Plugin Architecture
>>> > > > > https://kylin.apache.org/development/plugin_arch.html
>>> > > > > [3] 基于Druid的Kylin存储引擎实践
>>> > > > https://blog.bcmeng.com/post/kylin-on-druid.html--
>>> > > > > Best regards,
>>> > > > >
>>> > > > > Shaofeng Shi 史少锋
>>> > > >
>>> > >
>>> > >
>>> > > --
>>> > > Best regards,
>>> > >
>>> > > Shaofeng Shi 史少锋
>>> > >
>>> >
>>>
>>
>>
>> --
>> Best regards,
>>
>> Shaofeng Shi 史少锋
>>
>>
>
> --
> Best regards,
>
> Shaofeng Shi 史少锋
>
>

-- 
Best regards,

Shaofeng Shi 史少锋

Re: [DISCUSS] Columnar storage engine for Apache Kylin

Reply via email to