Re: [Proposal] lxdb - proposal for Apache Incubation

Juan Pan Sun, 28 Feb 2021 20:49:58 -0800

Hi,


My +1 for the suggestions and summary from Furkan KAMACI.
They are truly many IPMC concerns, I guess.
Some of the items will need you plenty of time to handle, 
I am unsure whether it is the best time for you to propose now.
But, at least I suppose you have a direction to improve.


Sincerely,
Trista



-------------------------------------------------------
Email：panj...@apache.org
Juan Pan(Trista) Apache ShardingSphere


On 02/28/2021 18:51，Furkan KAMACI<furkankam...@gmail.com> wrote：
Hi,

Actually you have a detailed documentation which explains which approach
you have compared to similar systems and performance metrics of following
them i.e. reducing storage 10 to the 100 times or having low latency
queries.

My advices are (some of them are same with Sheng's and Liang's ):

1) Find an experienced mentor to guide you.

2) Start to translate your documentation to English.

3) Open source your project. How can we have a comment on your project if
we cannot see anything about it?

4) Gain contributors to your project. At least you should show your
intention to have committers/contributors out of your company. Eliminate
the risk of being non-meritocratic management of the project.

5) Structure your proposal. Explain why people need this project, which
problems do current projects have and how you managed to handle them. We
should understand is it a bundle of other projects, a completely new
project, or a wrapper of other projects which eliminates the shortcomings
of them.

6) Find a suitable name for your project in order to not try to solve
trademark problems that may lose your time if you enter the incubation.

Kind Regards,
Furkan KAMACI


On Sun, Feb 28, 2021 at 1:02 PM Liang Chen <chenliang6...@gmail.com> wrote:

Hi

It would be better if you could find an experienced IPMC member to help you
for preparing the proposal.
Based on Sheng Wu input, i have one more comment : can you please explain
what are the different with other similar data analysis DB?  you can
consider explaining from use cases perspective.

Regards
Liang


fp wrote
Dear Apache Incubator Community,


Please accept the following proposal for presentation and discussion:
https://github.com/lucene-cn/lxdb/wiki


LXDB is a high-performance,OLAP,full text search database.it`s base on
hbase,but replaced hfile with lucene index to support more effective
secondary indexes,it`s also base on spark sql,so that you can used sql
api
to visit data and do olap calculate. and also the lucene index is store
on
hdfs (not local disk).


In our Production System, LXDB supported 200+ clusters,some of the single
cluster is 1000+ nodes,insert 200 billion rows&nbsp; per day ( 20000
billion rows for total), one of the biggest single table has 200million
lucene index on LXDB.


Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),
HDFS,
Lucene.We have merged these separated projects again,LXDB&nbsp;equals
spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10
years to complete these merging operations.But the purpose is no longer a
search engine, but a database.





Best regards
&nbsp; yannian mu




LXDB Proposal
== Abstract ==
LXDB is a high-performance,OLAP,full text search database.


=== it`s base on hbase,but replaced hfile with lucene index to support
more effective secondary indexes.===&nbsp;
we modify hbase region server ,we&nbsp; change hfile to lucene,when put
data we put&nbsp; document to lucene instande of&nbsp; put data to hfile
lucene index store on region server&nbsp;&nbsp;(it is not sote in
different cluster like elstice search+hbase ,it takes to copy of data)


=== it`s base on spark sql for olap===&nbsp;
we Integrated spark and hbase together ,it`s useage like this ,
1.unpackage lxdb.tar.gz&nbsp;
2.config hadoop_config path,
3.run start-all.sh to start cluster.&nbsp;
lxdb can startup spark through hadoop yarn ,and then spark executor
process Embedded start hbase region server service .&nbsp;


you can operate lxdb database throuth spark sql api(hive) or mysql api.
1.the sql used spark rdd+hbase scaner&nbsp; to visit hbase .
2.the sql`s condition (filter or group by agg) will predicate to hbase ,
3.hbase used lucene index to filter data in region server.
all of the spark,hbase,lucene is Embedded Integrated together,it is
not&nbsp; a&nbsp; seperate cluster ,that is the different with solr/es +
hbase+spark Solution.


== Background ==
=== Multiple copies of data ===
Apache HBase+Elastic Search is the most popular Solution on full text
search ,but it`s weak on Online AnalyticalProcessing.
so most of the time the Production System used spark(or hive or impala or
presto) ,hbase,solr/es at the same time.Multiple copies of data are
stored
in multiple systems,multiple systems has different Api .Data consistency
is difficult to guarantee.For the above reasons we merger
spark,hbase,elastic into one project .it`s target is used one copy of
data,one cluster,one api to solve olap,kv,full text...database scenarios.


=== Merging and splitting of lucene indexes(hstore) acrocess different
machine on hdfs ===
As we all know solr/es store file in local fileSystem,it`s shard num must
be a fix num,but if we store index on hdfs,the index can split able like
hbase hstore,it can split or merge acorss machine nodes ,this is very
usefull for distribute database ,it depend malloc how much resource on a
table,most of time the records of a table is different by time by time so
the num of shards always need adjust,if index store local it can`t split
acroces throw different machine ,but lucene index store on hdfs it`s can
do it.
whether the number of pieces can be flexibly adjusted, whether it has the
ability of elastic scaling, in a distributed database is particularly
important



=== solved Insufficient of&nbsp; secondary indexes ===
some people use hbase secondary index like Phoenix prjoect. but those
programme base on the hbase rowkey has a lot of redundancy,He can't
create
too many indexes,Data inflation rate is too high,so used lucene index
instand of secondary is the best chooses.&nbsp;


=== we add an lucene index for spark olap===&nbsp;
Most of OLAP systems has violent scanning problems and Poor timeliness of
data like hive,spark sql,impala or some of the mpp database.
1.They used violent scans to calculate the data.but another choice is add
index to the big data.some of the time using index can greatly improve
the
performance of the original brute force scanning. i think&nbsp; that just
like the traditional database, indexing technology can greatly improve
the
performance of the speed database.
2.Another problem of thoses database or system, Most of them are an
offline system or batch system,lxdb `s target is realtime append
,realtime
kv update just like hbase.


==future==
=== lucene on parquet ===
recenetly i will change lucene&nbsp; tim,tip(invert index) ,dvd,dvm files
to&nbsp; like parquet or orc format.
To solve the performance problem of traversing Lucene index.To solve the
problem that opening Lucene file needs to load files such as tip into
memory, which leads to slow opening Lucene index file,To enable Lucene to
store multi column joint index by column, which is used to handle some
logic such as multi table join and materialized view ,mulity fields group
by by invert index,The current Lucene index has many problems because of
too many file pointers and single column problems,We want to modify
Lucene
to make it more suitable for HDFS, not only for full-text retrieval, but
also better at statistical analysis, which is a real database level
index,We want Lucene to be splitable, which can separate storage from
computation.




===&nbsp; supporting all kinds of Predicate pushdown calculation&nbsp;===
We find that if we can combine the calculation method with the data
closely, we can give more play to the performance of the database. Index
is only a way of calculating push down. For example, storage push down,
we
can store the index on the SSD device, and the data part on the SATA
device. We can store the data that are often grouped together in advance,
instead of calculating line by line, We can give important tables or
columns to dedicated devices and resources, but these hbases are still
lacking, which we need to further improve


=== Distribution of intervention data ===
we can used row key to intervention data to different nodes ,it can do
many interestest things


=== Resource control, resource isolation ===
lucene recent is not support resource isolation,but&nbsp; on hdfs&nbsp;
we
can do it , I can control the priority of SQL so that Lucene with higher
priority can get faster IO resources.


== Status ==
since 2011 I released the first open source version on Alibaba&nbsp; ,At
that time, mdrill used 10 nodes 48g machines to support 400 billion data.
the first index on hdfs is from this version.it`s one year ahead of the
community.&nbsp; https://github.com/alibaba/mdrill .


since 2014 i stoped mdrill project update for the reason of i join into
tencent . in our team we developed&nbsp; hermes project ,we also build
lucene on hdfs , hermes now realtime import 1000 billion rows of data per
day.It's the largest database I've ever developed ,
https://plus.tencent.com/bigdata/hermes


since 2018 I set up my own company called luxin, Lu Xin is the Chinese
pronunciation of Lucene. as a funs of lucene ,luxin company`s domain is
lucene.xin ,mail domain is lucene.cn.
luxin`s first version of lxdb is called lsql,it`s means lucene sql.&nbsp;
it used lucene(2.5.3)+hdfs+spark(1.6.3),it is stable, about 200+ of
cluster use lsql. it`s process about 200 billions per day ,amount of
20000
billions rows in one&nbsp; single cluster. (1000 nodes)&nbsp;


since 2010 In the case of COVID-19 our team decide to developed the next
generation of lsql called lxdb(lx=lucene pronunciation&nbsp;). we add
hbase to lsql To solve the update problem.nowadays we have finish the
first version of lxdb.&nbsp;https://github.com/lucene-cn/lxdb/wiki







== Known Risks ==
==Meritocracy ==


lxdb has been deployed in production and is applying more than 200 lines
of business. It has demonstrated great performance benefits and has
proved
to be a better way for reporting and analysis based big data. Still We
look forward to growing a rich user and developer community.


=== Orphaned products ===


The core developers currently work full-time for Luxin.
lxdb is widely adopted by many companies and individuals. There's no
realistic chance of it becoming orphaned. and we have a number of 1000
person tencent qq Instant messaging group



=== Inexperience with Open Source===

The core developers are all active users and followers of open source.
They are already committers and contributors to the lxdb project.&nbsp;
developed yannian mu has tens years on open source project,&nbsp; jstorm
https://github.com/alibaba/jstorm and
mdrill&nbsp;https://github.com/alibaba/mdrill




=== Homogenous Developers ===&nbsp;


The most of core developers are from luxin for the Closed source products
reason, but when lxdb was open sourced, lxdb will received a lot of bug
fixes and enhancements from other developers not working at luxin.Where
did you learn it from and where did you return it.





===Reliance on Salaried Developers ===


Lxin invested in lxdb as the&nbsp; solution and some of its key engineers
are working full time on the project. In addition, since there is a
growing Big Data need for scalable solutions, we look forward to other
Apache developers and researchers to contribute to the project. Also key
to addressing the risk associated with relying on Salaried developers
from
a single entity is to increase the diversity of the contributors and
actively lobby , Apache lxdb intends to do this.


=== An Excessive Fascination with the Apache Brand ===


Lxdb is proposing to enter incubation at Apache in order to help efforts
to diversify the committer-base, not so much to capitalize on the Apache
brand. The Lxdb project is in production use already inside lxdb, but is
not expected to be an lxdb product for external customers. As such, the
lxdb project is not seeking to use the Apache brand as a marketing tool.





=== Documentation===&nbsp;


Information about Palo can be found at https://github.com/lucene-cn/lxdb
.
The following links provide more information about lxdb in open source:


* wiki site: https://github.com/lucene-cn/lxdb/wiki
* Issue Tracking: https://github.com/lucene-cn/lxdb/issues
* Overview: https://github.com/lucene-cn/lxdb/wiki/intro
* lxin home page: http://www.lucene.xin

* lsql document: http://docs.lucene.xin/lsql/v21/



##Initial Source


lxdb will development source code under an Apache license at
https://github.com/lucene-cn/lxdb.






=== Core Developers ===



Currently most of the core developers of LXDB are working in the research
Team of luxin.


- yannian mu (dev)&nbsp;
- yu chen (dev)&nbsp;
- guangshi hao (dev)&nbsp;
- wei sun (dev)&nbsp;
- qihua zheng (dev)&nbsp;
- xin wang (dev)&nbsp;
- qingsong liu (dev)&nbsp;
- anxing zhou (Tester)&nbsp;
- jiajun duan (Tester)&nbsp;



== External Dependencies ==

As all dependencies are managed using Apache Maven
Dependency&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; License&nbsp; &nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Optional?
lucene&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp;
&nbsp; &nbsp; &nbsp; true
zookeeper&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;Apache License
2.0&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; true
hbase&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache License 2.0&nbsp;
&nbsp; &nbsp; &nbsp; &nbsp; true
spark&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
true
hadoop&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; Apache
License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; true
hive&nbsp; &nbsp;Apache License 2.0&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;
true




== Required Resources ==


=== Mailing lists ===


&nbsp;* lxdb-private (PMC discussion)
&nbsp;* lxdb-dev (developer discussion)
&nbsp;* lxdb-user (user discussion)
&nbsp;* lxdb-commits (SCM commits)
&nbsp;* lxdb-issues (JIRA issue feed)


=== Subversion Directory ===


Instead of subversion, LXDB prefers to git as source control
management system: git://git.apache.org/lxdb





--
Sent from: http://apache-incubator-general.996316.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [Proposal] lxdb - proposal for Apache Incubation

Reply via email to