Re: Re: [Proposal] lxdb - proposal for Apache Incubation

2021-02-28 Thread f...@lucene.cn
x27;t choose this scheme in the end, This is the address of my 
improvement project https://github.com/lucene-cn/lxhadoop
One of the ideas that came to my mind later is to replace the format of parquet 
with the inverted and forward row of Lucene, so that I can carry out multi 
condition full-text retrieval. The multi column feature of parquet allows me to 
avoid the performance problem of random reading by efficiently traversing the 
inverted table

Different from alibaba analytic db
I'm not particularly familiar with analyticdb, so I just looked up some 
information through the search engine. If there is any misunderstanding, please 
criticize and correct me
Most of the time they are really similar,Analyticdb is a very excellent 
database, but its technical principles can hardly be found on the Internet. 
From my personal point of view, they may have the following differences
#1)Analyticdb is a cloud native data warehouse in the full sense,This is also 
the feature they added to the new edition, which supports the separation of 
storage and computing, and the time-sharing flexibility of resources on demand. 
The same piece of data can start different computing resources at different 
computing nodes according to different computing
However, lxdb is not a real cloud native database. Although we store the Lucene 
index on HDFS, we can only separate the storage from computing. At present, 
when the Lucene itself is opened for the first time, the index information such 
as tip must be preloaded into memory, which leads to the persistent opening of 
Lucene in the resident process, Therefore, lxdb has not been able to separate 
computing from computing, that is, it has not been able to distribute computing 
resources to different processes according to different queries. This has 
always been a pity of lxdb, so I have been trying these years
At present, cloud native database has great market potential, and we are 
willing to try it,And I know that it's not difficult to change Lucene like 
this, or it's less difficult than integrating spark, HBase and Lucene together.
#2)Analyticdb can't be built by itself, it can only run on the cloud platform 
provided by it,Must be purchased with the underlying cloud environment, which 
sometimes gives users more restrictions. Lxdb is based on Hadoop platform. As 
long as users have Hadoop environment, lxdb can directly start services through 
yard, which is suitable for private deployment and deployment on the cloud, and 
it doesn't limit any manufacturers. It is relatively open
#3)I feel that it is more like a batch engine,It is more like a scene of 
centralized import and batch query,At least his cloud native model should be 
like this,Or I didn't find the user manual for real-time import
, while lxdb is a real-time engine with low data latency,Relatively speaking, 
it is easier for batch engine to realize cloud native, while it is more 
difficult for real-time millisecond delay engine to realize the separation of 
storage and computing. It needs a snapshot mechanism to record the data change 
at a certain time, so as to realize the separation of computing and computing 
between different nodes
#4 According to the official documents, see specifications and restrictions, 
the best configuration is C32. The number of nodes supported by C32 is less 
than 128, and the storage capacity is 1PB. In the production environment, lxdb 
has 904 nodes, 50pb disk capacity, and 70% storage utilization,Of course, it 
can be inaccurate and unfair to adb.



f...@lucene.cn  yannian mu



f...@lucene.cn  yannian mu
 
From: Ming Wen
Date: 2021-02-28 21:18
To: general
Subject: Re: Re: [Proposal] lxdb - proposal for Apache Incubation
Hi, fp,
Your email is hard to read.
Please change to a normal mail client first.
Back to your proposal, the key concern is not technology, but IPMC can not
evaluate a project when we can see anything.
 
Thanks,
Ming Wen, Apache APISIX PMC Chair
Twitter: _WenMing
 
 
f...@lucene.cn  于2021年2月28日周日 下午9:02写道:
 
> Hi Furkan Kamaci
>
>
> Thank you for your proposal, I will start to improve and prepare
>
>
>
>
> 1.Find an experienced mentor to guide you.
>
>
>
>  todo
>
>
>
> 2.Start to translate your documentation to English.
>
>
>
> 3.Open source your project. How can we have a comment on your project if
>
>
>
> we cannot see anything about it?
>
>
>
>
>
>
>
>  give me some time,I discussed with my team, my English is too poor.
>
>
>
>
>
>
>
> 4) Gain contributors to your project. At least you should show your
>
>
>
> intention to have committers/contributors out of your company. Eliminate
>
>
>
> the risk of being non-meritocratic management of the project.
>
>
>
>
>
>
>
> That's what I have to do
>
>
>
>
>
>
>
> 5) 

Re: Re: [Proposal] lxdb - proposal for Apache Incubation

2021-02-28 Thread f...@lucene.cn
Hi Furkan Kamaci


Thank you for your proposal, I will start to improve and prepare




1.Find an experienced mentor to guide you.



     todo



2.Start to translate your documentation to English.



3.Open source your project. How can we have a comment on your project if



we cannot see anything about it?







     give me some time,I discussed with my team, my English is too poor.







4) Gain contributors to your project. At least you should show your



intention to have committers/contributors out of your company. Eliminate



the risk of being non-meritocratic management of the project.







That's what I have to do







5) Structure your proposal. Explain why people need this project, which



problems do current projects have and how you managed to handle them. We



should understand is it a bundle of other projects, a completely new



project, or a wrapper of other projects which eliminates the shortcomings



of them.



6) Find a suitable name for your project in order to not try to solve



trademark problems that may lose your time if you enter the incubation.







ok i thike a new name ,for example like hydrogen sql 















f...@lucene.cn  yannian mu



 



From: Furkan KAMACI



Date: 2021-02-28 18:51



To: general



Subject: Re: [Proposal] lxdb - proposal for Apache Incubation



Hi,



 



Actually you have a detailed documentation which explains which approach



you have compared to similar systems and performance metrics of following



them i.e. reducing storage 10 to the 100 times or having low latency



queries.



 



My advices are (some of them are same with Sheng's and Liang's ):



 



1) Find an experienced mentor to guide you.



 



2) Start to translate your documentation to English.



 



3) Open source your project. How can we have a comment on your project if



we cannot see anything about it?



 



4) Gain contributors to your project. At least you should show your



intention to have committers/contributors out of your company. Eliminate



the risk of being non-meritocratic management of the project.



 



5) Structure your proposal. Explain why people need this project, which



problems do current projects have and how you managed to handle them. We



should understand is it a bundle of other projects, a completely new



project, or a wrapper of other projects which eliminates the shortcomings



of them.



 



6) Find a suitable name for your project in order to not try to solve



trademark problems that may lose your time if you enter the incubation.



 



Kind Regards,



Furkan KAMACI



 



 



On Sun, Feb 28, 2021 at 1:02 PM Liang Chen  wrote:



 



> Hi



>



> It would be better if you could find an experienced IPMC member to help you



> for preparing the proposal.



> Based on Sheng Wu input, i have one more comment : can you please explain



> what are the different with other similar data analysis DB?  you can



> consider explaining from use cases perspective.



>



> Regards



> Liang



>



>



> fp wrote



> > Dear Apache Incubator Community,



> >



> >



> > Please accept the following proposal for presentation and discussion:



> > https://github.com/lucene-cn/lxdb/wiki



> >



> >



> > LXDB is a high-performance,OLAP,full text search database.it`s base on



> > hbase,but replaced hfile with lucene index to support more effective



> > secondary indexes,it`s also base on spark sql,so that you can used sql



> api



> > to visit data and do olap calculate. and also the lucene index is store



> on



> > hdfs (not local disk).



> >



> >



> > In our Production System, LXDB supported 200+ clusters,some of the single



> > cluster is 1000+ nodes,insert 200 billion rows  per day ( 2



> > billion rows for total), one of the biggest single table has 200million



> > lucene index on LXDB.



> >



> >



> > Hadoop`s father Doug Cutting cut nutch into HBase, MapReduce (hive),



> HDFS,



> > Lucene.We have merged these separated projects again,LXDB equals



> > spark sql+hbase+lucene+parquet+hdfs,it is a super database.It took me 10



> > years to complete these merging operations.But the purpose is no longer a



> > search engine, but a database.



> >



> >



> >



> >



> >



> > Best regards



> >   yannian mu



> >



> >



> >



> >



> > LXDB Proposal



> > == Abstract ==



> > LXDB is a high-performance,OLAP,full text search database.



> >



> >



> > === it`s base on hbase,but replaced hfile with lucene index to support



> > more effective secondary indexes.=== 



> > we modify hbase region server ,we  change 

Re: Re: [Proposal] lxdb - proposal for Apache Incubation

2021-02-28 Thread f...@lucene.cn
sed on spark, and its core is the underlying data structure of 
spark. We can improve the speed of spark by unique data format such as 
index,Whether the data has an index and whether the index is stored on the 
local disk or HDFS is a significant feature that distinguishes us from other 
analytical databases, such as hive, spark SQL, impala and some MAPP 
databases,On this point, we are consistent with carbontata

Our team later spent a certain amount of energy to do a test with carbontata, 
and the positioning in some directions is still very different,

As for Clickhouse, I didn't come across many projects before. Until one day, 
when I was recruiting in the group, someone asked me, is your product as fast 
as Clickhouse? Therefore, I knew that there was such a good product in the 
industry,



#1 Coarse grained index vs fine-grained index, or index stored by block and 
index not stored by block,
We found that the writing speed of carbondata and Clickhouse is very fast, 
while we used lxdb and elastic search at the same time, because both of them 
are based on Lucene, which is an order of magnitude lower than the former two

#2 Later, we found that the main difference lies in the way of index. One is 
the index by block, and the other is the overall global index. The former is 
very fast in storage, and it is easier to separate index and calculation. Even 
carbondata is a real cloud native database (the Clickhouse data is stored 
locally, not cloud native), But the benefit is not only the improvement of 
single column filtering, but also the improvement of multi condition 
combination filtering and the convenience of updating. If the former is not 
handled properly, it is easy to cause full scan, but there will be a high cost 
to realize updating, The latter can be combined with BitSet or bloom filter to 
realize the combination of multi column conditions, and the global index is 
more suitable for updating. Therefore, lxdb and es have the characteristics of 
real-time updating. This is why we are different from carbondata. We inherit a 
HBase in comparison, and the main purpose is to realize the real-time updating 
of kV level, In the future, if lxdb wants to take a step on the cloud native 
Road, it is bound to make some innovations and changes in the index format of 
Lucene

#3 Because lxdb is bound to HBase in the future, OLTP at kV level is also a 
direction in the future
#4 In terms of statistical analysis, the performance of docvalues used by 
Lucene is not as good as that of carbondata and clickhouse,Because of this 
reason, I spent some experience to improve the performance of random reading on 
HDFS, and the speed can be increased by 100-200 times. But I think the code to 
modify HDFS will lead to poor compatibility of our products in the customer 
platform in the future, and will force customers to replace Hadoop with our 
version. I didn't choose this scheme in the end, This is the address of my 
improvement project https://github.com/lucene-cn/lxhadoop
One of the ideas that came to my mind later is to replace the format of parquet 
with the inverted and forward row of Lucene, so that I can carry out multi 
condition full-text retrieval. The multi column feature of parquet allows me to 
avoid the performance problem of random reading by efficiently traversing the 
inverted table



(carbondata在2015年出现的时候,是一个让我非常震惊的产品,给大数据加一层索引,是我这些年一直做的事情,没想到在这个世界上还能有一个团队跟我的想法一样,都是基于hadoop,甚至启动也都是基于spark
 on yarn)
(大家都是基于spark,其核心也都是动了spark底层的数据结构,通过独特的数据格式如索引来达到给spark提升速度的目的,所以是否有索引,以及索引是存储在本地磁盘还是存储在hdfs上是我们区分与其他分析型数据库的一个显著特性,如与hive,spark-sql,impala以及一些mapp数据库,而在这一点上,我们跟carbondata是一致的)

我们团队后来花了一定的精力跟carbondata做了一个测试,在一些方向上的定位,还是有很大的不同

至于clickhouse 
之前我在项目中碰到的并不多,直到有一天,我在群里招聘的时候,有一个人问我,你这个产品有clickhouse快么,因此我才知道业界还有一个这么牛的一个产品

#1粗粒度的索引 vs 细粒度的索引,按块存储的索引与非按块存储的索引
(我们测试发现carbondata与clickhouse的写入速度非常非常的快,而我们同时使用lxdb与elastic search进行测试 
因为两者都是基于lucene,发现比前两者相差一个数量级)
#2(后来我们发现主要差别在索引的方式,一个是按块的索引一个是整体的全局的索引,前者入库速度非常快,而且更容易实现索引与计算分离,甚至carbondata也是一个真正意义上的云原生数据库(clickhouse数据存储在本地,不能是云原生的),而整体的全局的索引需要不断的合并segments会有入库性能损耗.但带来的益处则是不仅仅是在单列筛选过滤上的提升,在多条件组合筛选性能的提升以及更新上的便利,前者处理不好容易导致full
 scan,而要实现更新则会有较大的代价,后者则可以通过结合bitset 或bloom 
filter实现多列条件的组合筛选,全局的索引更适合更新,故lxdb和es则都具备实时更新的特性,这也是为什么我们与carbondata不同的地方,我们对比下多继承了一个hbase进来,主要目的也是为了实现kv层次的实时更新,而未来lxdb如果想在云原生的路上要走一步,势必就要在lucene的索引格式上做一些创新和变更)
#3(而未来lxdb的因为与hbase做了绑定,kv层次的oltp也是未来一个方向)
#4 
在统计分析性能上lucene采用的docvalues大量随机读的表现不如carbondata,因为这个原因,我花了一些经历改进hdfs上的随机读的性能,速度能提升100~200倍,但是我觉得这个要修改hdfs的代码,会导致未来我们产品在客户平台的兼容性不好,会强迫客户将hadoop更换为我们的版本,我最终没有选择这个方案
 ,这个是我改进的项目地址 https://github.com/lucene-cn/lxhadoop
我后来想到的一个思路就是将parquet的格式替换到lucene的倒排与正排上,这样我既能进行多条件的全文检索,在检索的时候parquet多列的特性又能让我通过高效的遍历倒排表来规避随机读的性能问题

















f...@lucene.cn  yannian mu



 



From: Liang Chen



Date: 2021-02-28 18:02



To: general



Subject: Re: [Proposal] lxdb - proposal for Apache Incubation



Hi



 



It would be better if you could find an experienced IPMC member to help you



fo