Re: [Proposal] lxdb - proposal for Apache Incubation

Liang Chen Sun, 28 Feb 2021 02:02:25 -0800

Hi

It would be better if you could find an experienced IPMC member to help you
for preparing the proposal.
Based on Sheng Wu input, i have one more comment : can you please explain
what are the different with other similar data analysis DB?  you can
consider explaining from use cases perspective.


Regards
Liang


fp wrote
> Dear Apache Incubator Community,
> 
> 
> Please accept the following proposal for presentation and discussion:
> https://github.com/lucene-cn/lxdb/wiki
> 
> 
> LXDB is a high-performance,OLAP,full text search database.it`s base on
> hbase,but replaced hfile with lucene index to support more effective
> secondary indexes,it`s also base on spark sql,so that you can used sql api
> to visit data and do olap calculate. and also the lucene index is store on
> hdfs (not local disk).
> 
> 
> In our Production System, LXDB supported 200+ clusters,some of the single
> cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
> billion rows for total), one of the biggest single table has 200million
> lucene index on LXDB.
> 
> 
> Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS,
> Lucene.We have merged these separated projects again,LXDB&nbsp;equals
> spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10
> years to complete these merging operations.But the purpose is no longer a
> search engine, but a database.
> 
> 
> 
> 
> 
> Best regards
> &nbsp; yannian mu
> 
> 
> 
> 
> LXDB Proposal
> == Abstract ==
> LXDB is a high-performance,OLAP,full text search database.
> 
> 
> === it`s base on hbase,but replaced hfile with lucene index to support
> more effective secondary indexes.===&nbsp;
> we modify hbase region server ,we&nbsp; change hfile to lucene,when put
> data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
> lucene index store on region server&nbsp;&nbsp;(it is not sote in
> different cluster like elstice search+hbase ,it takes to copy of data)
> 
> 
> === it`s base on spark sql for olap===&nbsp;
> we Integrated spark and hbase together ,it`s useage like this ,
> 1.unpackage lxdb.tar.gz&nbsp;
> 2.config hadoop_config path,
> 3.run start-all.sh to start cluster.&nbsp;
> lxdb can startup spark through hadoop yarn ,and then spark executor
> process Embedded start hbase region server service .&nbsp;
> 
> 
> you can operate lxdb database throuth spark sql api(hive) or mysql api.
> 1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
> 2.the sql`s condition (filter or group by agg) will predicate to hbase ,
> 3.hbase used lucene index to filter data in region server.
> all of the spark,hbase,lucene is Embedded Integrated together,it is
> not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +
> hbase+spark Solution.
> 
> 
> == Background ==
> === Multiple copies of data ===
> Apache HBase+Elastic Search is the most popular Solution on full text
> search ,but it`s weak on Online AnalyticalProcessing.
> so most of the time the Production System used spark(or hive or impala or
> presto) ,hbase,solr/es at the same time.Multiple copies of data are stored
> in multiple systems,multiple systems has different Api .Data consistency
> is difficult to guarantee.For the above reasons we merger
> spark,hbase,elastic into one project .it`s target is used one copy of
> data,one cluster,one api to solve olap,kv,full text...database scenarios.
> 
> 
> === Merging and splitting of lucene indexes(hstore) acrocess different
> machine on hdfs ===
> As we all know solr/es store file in local fileSystem,it`s shard num must
> be a fix num,but if we store index on hdfs,the index can split able like
> hbase hstore,it can split or merge acorss machine nodes ,this is very
> usefull for distribute database ,it depend malloc how much resource on a
> table,most of time the records of a table is different by time by time so
> the num of shards always need adjust,if index store local it can`t split
> acroces throw different machine ,but lucene index store on hdfs it`s can
> do it.
> whether the number of pieces can be flexibly adjusted, whether it has the
> ability of elastic scaling, in a distributed database is particularly
> important
> 
> 
> 
> === solved Insufficient of&nbsp; secondary indexes ===
> some people use hbase secondary index like Phoenix prjoect. but those
> programme base on the hbase rowkey has a lot of redundancy,He can't create
> too many indexes,Data inflation rate is too high,so used lucene index
> instand of secondary is the best chooses.&nbsp;
> 
> 
> === we add an lucene index for spark olap===&nbsp;
> Most of OLAP systems has violent scanning problems and Poor timeliness of
> data like hive,spark sql,impala or some of the mpp database.
> 1.They used violent scans to calculate the data.but another choice is add
> index to the big data.some of the time using index can greatly improve the
> performance of the original brute force scanning. i think&nbsp; that just
> like the traditional database, indexing technology can greatly improve the
> performance of the speed database.
> 2.Another problem of thoses database or system, Most of them are an
> offline system or batch system,lxdb `s target is realtime append ,realtime
> kv update just like hbase.
> 
> 
> ==future==
> === lucene on parquet ===
> recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files
> to&nbsp; like parquet or orc format.
> To solve the performance problem of traversing Lucene index.To solve the
> problem that opening Lucene file needs to load files such as tip into
> memory, which leads to slow opening Lucene index file,To enable Lucene to
> store multi column joint index by column, which is used to handle some
> logic such as multi table join and materialized view ,mulity fields group
> by by invert index,The current Lucene index has many problems because of
> too many file pointers and single column problems,We want to modify Lucene
> to make it more suitable for HDFS, not only for full-text retrieval, but
> also better at statistical analysis, which is a real database level
> index,We want Lucene to be splitable, which can separate storage from
> computation.
> 
> 
> 
> 
> ===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===
> We find that if we can combine the calculation method with the data
> closely, we can give more play to the performance of the database. Index
> is only a way of calculating push down. For example, storage push down, we
> can store the index on the SSD device, and the data part on the SATA
> device. We can store the data that are often grouped together in advance,
> instead of calculating line by line, We can give important tables or
> columns to dedicated devices and resources, but these hbases are still
> lacking, which we need to further improve
> 
> 
> === Distribution of intervention data ===
> we can used row key to intervention data to different nodes ,it can do
> many interestest things
> 
> 
> === Resource control, resource isolation ===
> lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp; we
> can do it , I can control the priority of SQL so that Lucene with higher
> priority can get faster IO resources.
> 
> 
> == Status ==
> since 2011 I released the first open source version on Alibaba&nbsp; ,At
> that time, mdrill used 10 nodes 48g machines to support 400 billion data.
> the first index on hdfs is from this version.it`s one year ahead of the
> community.&nbsp; https://github.com/alibaba/mdrill .
> 
> 
> since 2014 i stoped mdrill project update for the reason of i join into
> tencent . in our team we developed&nbsp; hermes project ,we also build
> lucene on hdfs , hermes now realtime import 1000 billion rows of data per
> day.It's the largest database I've ever developed ,
> https://plus.tencent.com/bigdata/hermes
> 
> 
> since 2018 I set up my own company called luxin, Lu Xin is the Chinese
> pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
> lucene.xin ,mail domain is lucene.cn.
> luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;
> it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
> cluster use lsql. it`s process about 200 billions per day ,amount of 20000
> billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;
> 
> 
> since 2010 In the case of COVID-19 our team decide to developed the next
> generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
> hbase to lsql To solve the update problem.nowadays we have finish the
> first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki
> 
> 
> 
> 
> 
> 
> 
> == Known Risks ==
> ==Meritocracy ==
> 
> 
> lxdb has been deployed in production and is applying more than 200 lines
> of business. It has demonstrated great performance benefits and has proved
> to be a better way for reporting and analysis based big data. Still We
> look forward to growing a rich user and developer community.
> 
> 
> === Orphaned products ===
> 
> 
> The core developers currently work full-time for Luxin.
> lxdb is widely adopted by many companies and individuals. There's no
> realistic chance of it becoming orphaned. and we have a number of 1000
> person tencent qq Instant messaging group
> 
> 
> 
> === Inexperience with Open Source===
> 
> The core developers are all active users and followers of open source.
> They are already committers and contributors to the lxdb project.&nbsp;
> developed yannian mu has tens years on open source project,&nbsp; jstorm
> https://github.com/alibaba/jstorm and
> mdrill&nbsp;https://github.com/alibaba/mdrill
> 
> 
> 
> 
> === Homogenous Developers ===&nbsp;
> 
> 
> The most of core developers are from luxin for the Closed source products
> reason, but when lxdb was open sourced, lxdb will received a lot of bug
> fixes and enhancements from other developers not working at luxin.Where
> did you learn it from and where did you return it.
> 
> 
> 
> 
> 
> ===Reliance on Salaried Developers ===
> 
> 
> Lxin invested in lxdb as the&nbsp; solution and some of its key engineers
> are working full time on the project. In addition, since there is a
> growing Big Data need for scalable solutions, we look forward to other
> Apache developers and researchers to contribute to the project. Also key
> to addressing the risk associated with relying on Salaried developers from
> a single entity is to increase the diversity of the contributors and
> actively lobby , Apache lxdb intends to do this.
> 
> 
> === An Excessive Fascination with the Apache Brand ===
> 
> 
> Lxdb is proposing to enter incubation at Apache in order to help efforts
> to diversify the committer-base, not so much to capitalize on the Apache
> brand. The Lxdb project is in production use already inside lxdb, but is
> not expected to be an lxdb product for external customers. As such, the
> lxdb project is not seeking to use the Apache brand as a marketing tool.
> 
> 
> 
> 
> 
> === Documentation===&nbsp;
> 
> 
> Information about Palo can be found at https://github.com/lucene-cn/lxdb.
> The following links provide more information about lxdb in open source:
> 
> 
> * wiki site: https://github.com/lucene-cn/lxdb/wiki
> * Issue Tracking: https://github.com/lucene-cn/lxdb/issues
> * Overview: https://github.com/lucene-cn/lxdb/wiki/intro
> * lxin home page: http://www.lucene.xin
> 
> * lsql document: http://docs.lucene.xin/lsql/v21/
> 
> 
> 
> ##Initial Source
> 
> 
> lxdb will development source code under an Apache license at
> https://github.com/lucene-cn/lxdb.
> 
> 
> 
> 
> 
> 
> === Core Developers ===
> 
> 
> 
> Currently most of the core developers of LXDB are working in the research
> Team of luxin.
> 
> 
> - yannian mu (dev)&nbsp;
> - yu chen (dev)&nbsp;
> - guangshi hao (dev)&nbsp;
> - wei sun (dev)&nbsp;
> - qihua zheng (dev)&nbsp;
> - xin wang (dev)&nbsp;
> - qingsong liu (dev)&nbsp;
> - anxing zhou (Tester)&nbsp;
> - jiajun duan (Tester)&nbsp;
> 
> 
> 
> == External Dependencies ==
> 
> As all dependencies are managed using Apache Maven
> Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
> lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
> &nbsp; &nbsp; &nbsp; true
> zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; true
> hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
> &nbsp; &nbsp; &nbsp; &nbsp; true
> spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
> true
> hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
> License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
> 
> 
> 
> 
> == Required Resources ==
> 
> 
> === Mailing lists ===
> 
> 
> &nbsp;* lxdb-private (PMC discussion)
> &nbsp;* lxdb-dev (developer discussion)
> &nbsp;* lxdb-user (user discussion)
> &nbsp;* lxdb-commits (SCM commits)
> &nbsp;* lxdb-issues (JIRA issue feed)
> 
> 
> === Subversion Directory ===
> 
> 
> Instead of subversion, LXDB prefers to git as source control
> management system: git://git.apache.org/lxdb





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [Proposal] lxdb - proposal for Apache Incubation

Reply via email to