Hi It would be better if you could find an experienced IPMC member to help you for preparing the proposal. Based on Sheng Wu input, i have one more comment : can you please explain what are the different with other similar data analysis DB? you can consider explaining from use cases perspective.
Regards Liang fp wrote > Dear Apache Incubator Community, > > > Please accept the following proposal for presentation and discussion: > https://github.com/lucene-cn/lxdb/wiki > > > LXDB is a high-performance,OLAP,full text search database.it`s base on > hbase,but replaced hfile with lucene index to support more effective > secondary indexes,it`s also base on spark sql,so that you can used sql api > to visit data and do olap calculate. and also the lucene index is store on > hdfs (not local disk). > > > In our Production System, LXDB supported 200+ clusters,some of the single > cluster is 1000+ nodes,insert 200 billion rows per day ( 20000 > billion rows for total), one of the biggest single table has 200million > lucene index on LXDB. > > > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive), HDFS, > Lucene.We have merged these separated projects again,LXDB equals > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10 > years to complete these merging operations.But the purpose is no longer a > search engine, but a database. > > > > > > Best regards > yannian mu > > > > > LXDB Proposal > == Abstract == > LXDB is a high-performance,OLAP,full text search database. > > > === it`s base on hbase,but replaced hfile with lucene index to support > more effective secondary indexes.=== > we modify hbase region server ,we change hfile to lucene,when put > data we put document to lucene instande of put data to hfile > lucene index store on region server (it is not sote in > different cluster like elstice search+hbase ,it takes to copy of data) > > > === it`s base on spark sql for olap=== > we Integrated spark and hbase together ,it`s useage like this , > 1.unpackage lxdb.tar.gz > 2.config hadoop_config path, > 3.run start-all.sh to start cluster. > lxdb can startup spark through hadoop yarn ,and then spark executor > process Embedded start hbase region server service . > > > you can operate lxdb database throuth spark sql api(hive) or mysql api. > 1.the sql used spark rdd+hbase scaner to visit hbase . > 2.the sql`s condition (filter or group by agg) will predicate to hbase , > 3.hbase used lucene index to filter data in region server. > all of the spark,hbase,lucene is Embedded Integrated together,it is > not a seperate cluster ,that is the different with solr/es + > hbase+spark Solution. > > > == Background == > === Multiple copies of data === > Apache HBase+Elastic Search is the most popular Solution on full text > search ,but it`s weak on Online AnalyticalProcessing. > so most of the time the Production System used spark(or hive or impala or > presto) ,hbase,solr/es at the same time.Multiple copies of data are stored > in multiple systems,multiple systems has different Api .Data consistency > is difficult to guarantee.For the above reasons we merger > spark,hbase,elastic into one project .it`s target is used one copy of > data,one cluster,one api to solve olap,kv,full text...database scenarios. > > > === Merging and splitting of lucene indexes(hstore) acrocess different > machine on hdfs === > As we all know solr/es store file in local fileSystem,it`s shard num must > be a fix num,but if we store index on hdfs,the index can split able like > hbase hstore,it can split or merge acorss machine nodes ,this is very > usefull for distribute database ,it depend malloc how much resource on a > table,most of time the records of a table is different by time by time so > the num of shards always need adjust,if index store local it can`t split > acroces throw different machine ,but lucene index store on hdfs it`s can > do it. > whether the number of pieces can be flexibly adjusted, whether it has the > ability of elastic scaling, in a distributed database is particularly > important > > > > === solved Insufficient of secondary indexes === > some people use hbase secondary index like Phoenix prjoect. but those > programme base on the hbase rowkey has a lot of redundancy,He can't create > too many indexes,Data inflation rate is too high,so used lucene index > instand of secondary is the best chooses. > > > === we add an lucene index for spark olap=== > Most of OLAP systems has violent scanning problems and Poor timeliness of > data like hive,spark sql,impala or some of the mpp database. > 1.They used violent scans to calculate the data.but another choice is add > index to the big data.some of the time using index can greatly improve the > performance of the original brute force scanning. i think that just > like the traditional database, indexing technology can greatly improve the > performance of the speed database. > 2.Another problem of thoses database or system, Most of them are an > offline system or batch system,lxdb `s target is realtime append ,realtime > kv update just like hbase. > > > ==future== > === lucene on parquet === > recenetly i will change lucene tim,tip(invert index) ,dvd,dvm files > to like parquet or orc format. > To solve the performance problem of traversing Lucene index.To solve the > problem that opening Lucene file needs to load files such as tip into > memory, which leads to slow opening Lucene index file,To enable Lucene to > store multi column joint index by column, which is used to handle some > logic such as multi table join and materialized view ,mulity fields group > by by invert index,The current Lucene index has many problems because of > too many file pointers and single column problems,We want to modify Lucene > to make it more suitable for HDFS, not only for full-text retrieval, but > also better at statistical analysis, which is a real database level > index,We want Lucene to be splitable, which can separate storage from > computation. > > > > > === supporting all kinds of Predicate pushdown calculation === > We find that if we can combine the calculation method with the data > closely, we can give more play to the performance of the database. Index > is only a way of calculating push down. For example, storage push down, we > can store the index on the SSD device, and the data part on the SATA > device. We can store the data that are often grouped together in advance, > instead of calculating line by line, We can give important tables or > columns to dedicated devices and resources, but these hbases are still > lacking, which we need to further improve > > > === Distribution of intervention data === > we can used row key to intervention data to different nodes ,it can do > many interestest things > > > === Resource control, resource isolation === > lucene recent is not support resource isolation,but on hdfs we > can do it , I can control the priority of SQL so that Lucene with higher > priority can get faster IO resources. > > > == Status == > since 2011 I released the first open source version on Alibaba ,At > that time, mdrill used 10 nodes 48g machines to support 400 billion data. > the first index on hdfs is from this version.it`s one year ahead of the > community. https://github.com/alibaba/mdrill . > > > since 2014 i stoped mdrill project update for the reason of i join into > tencent . in our team we developed hermes project ,we also build > lucene on hdfs , hermes now realtime import 1000 billion rows of data per > day.It's the largest database I've ever developed , > https://plus.tencent.com/bigdata/hermes > > > since 2018 I set up my own company called luxin, Lu Xin is the Chinese > pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is > lucene.xin ,mail domain is lucene.cn. > luxin`s first version of lxdb is called lsql,it`s means lucene sql. > it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of > cluster use lsql. it`s process about 200 billions per day ,amount of 20000 > billions rows in one single cluster. (1000 nodes) > > > since 2010 In the case of COVID-19 our team decide to developed the next > generation of lsql called lxdb(lx=lucene pronunciation ). we add > hbase to lsql To solve the update problem.nowadays we have finish the > first version of lxdb. https://github.com/lucene-cn/lxdb/wiki > > > > > > > > == Known Risks == > ==Meritocracy == > > > lxdb has been deployed in production and is applying more than 200 lines > of business. It has demonstrated great performance benefits and has proved > to be a better way for reporting and analysis based big data. Still We > look forward to growing a rich user and developer community. > > > === Orphaned products === > > > The core developers currently work full-time for Luxin. > lxdb is widely adopted by many companies and individuals. There's no > realistic chance of it becoming orphaned. and we have a number of 1000 > person tencent qq Instant messaging group > > > > === Inexperience with Open Source=== > > The core developers are all active users and followers of open source. > They are already committers and contributors to the lxdb project. > developed yannian mu has tens years on open source project, jstorm > https://github.com/alibaba/jstorm and > mdrill https://github.com/alibaba/mdrill > > > > > === Homogenous Developers === > > > The most of core developers are from luxin for the Closed source products > reason, but when lxdb was open sourced, lxdb will received a lot of bug > fixes and enhancements from other developers not working at luxin.Where > did you learn it from and where did you return it. > > > > > > ===Reliance on Salaried Developers === > > > Lxin invested in lxdb as the solution and some of its key engineers > are working full time on the project. In addition, since there is a > growing Big Data need for scalable solutions, we look forward to other > Apache developers and researchers to contribute to the project. Also key > to addressing the risk associated with relying on Salaried developers from > a single entity is to increase the diversity of the contributors and > actively lobby , Apache lxdb intends to do this. > > > === An Excessive Fascination with the Apache Brand === > > > Lxdb is proposing to enter incubation at Apache in order to help efforts > to diversify the committer-base, not so much to capitalize on the Apache > brand. The Lxdb project is in production use already inside lxdb, but is > not expected to be an lxdb product for external customers. As such, the > lxdb project is not seeking to use the Apache brand as a marketing tool. > > > > > > === Documentation=== > > > Information about Palo can be found at https://github.com/lucene-cn/lxdb. > The following links provide more information about lxdb in open source: > > > * wiki site: https://github.com/lucene-cn/lxdb/wiki > * Issue Tracking: https://github.com/lucene-cn/lxdb/issues > * Overview: https://github.com/lucene-cn/lxdb/wiki/intro > * lxin home page: http://www.lucene.xin > > * lsql document: http://docs.lucene.xin/lsql/v21/ > > > > ##Initial Source > > > lxdb will development source code under an Apache license at > https://github.com/lucene-cn/lxdb. > > > > > > > === Core Developers === > > > > Currently most of the core developers of LXDB are working in the research > Team of luxin. > > > - yannian mu (dev) > - yu chen (dev) > - guangshi hao (dev) > - wei sun (dev) > - qihua zheng (dev) > - xin wang (dev) > - qingsong liu (dev) > - anxing zhou (Tester) > - jiajun duan (Tester) > > > > == External Dependencies == > > As all dependencies are managed using Apache Maven > Dependency License > Optional? > lucene Apache License 2.0 > true > zookeeper Apache License 2.0 > true > hbase Apache License 2.0 > true > spark Apache License 2.0 > true > hadoop Apache > License 2.0 true > hive Apache License 2.0 true > > > > > == Required Resources == > > > === Mailing lists === > > > * lxdb-private (PMC discussion) > * lxdb-dev (developer discussion) > * lxdb-user (user discussion) > * lxdb-commits (SCM commits) > * lxdb-issues (JIRA issue feed) > > > === Subversion Directory === > > > Instead of subversion, LXDB prefers to git as source control > management system: git://git.apache.org/lxdb -- Sent from: http://apache-incubator-general.996316.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org