Modify Lucene to make it an inverted index suitable for cloud native environment

f...@lucene.cn Thu, 04 Mar 2021 22:48:36 -0800

With the landing of lxdb, I always feel that there is something missing. Before 
lxdb started, I took a pen and drew what kind of database is the perfect 
database in my mind. Now that the main design goal has been completed, I still 
feel that it is not perfect. In essence, Lucene, the core of recording letters, 
is not perfect enough and cannot be split. There are still some deficiencies in 
the cloud native environment

The essence of lxdb is to integrate spark, HBase and Lucene into one product,
just like the gourd baby that I saw when I was a perfect child. Seven gourd
brothers merged into a big diamond gourd baby, which is more powerful. It has
the powerful OLAP analysis ability of spark, the real-time update ability of
HBase, and the rapid multi-dimensional filtering with the help of Lucene index,
Roughly speaking, it is almost a perfect product, which can almost meet most of
the scenarios in the field of big data, such as perfect distributed storage,
distributed computing, high concurrency and flexibility. Most of the products
on the market can not meet the technical perfection of lxdb, the timeliness of
kudu, the OLAP performance of spark, the full-text retrieval of ES and the high
concurrency of HBase. But the real use is that there are some very unpleasant
places. Let me give you some examples one by one.

== Existing problems==
1. The process must be resident and cannot be used on demand
The disadvantage of Lucene and HBase is that once the service is started, the
process must be resident. No matter whether there is query or data import,
these processes must be hung on it
What I expect more is that like the native spark, it can start some processes
when there are SQL queries. When these processes are not used, they are slowly
recycled

2. Different calculations of the same data are not separated, so it is
impossible to realize the resource isolation of calculation
Another disadvantage of resident process is that all calculations must be read
by resident process, and most of the time tasks have priority. The response
speed of ad hoc query task is much faster than that of batch query task. We
want to give more and faster resources to ad hoc query task, and let batch task
run slowly in the background

The resident process brings us a lot of trouble in this aspect. We prefer to
separate computing, and separate different types of tasks to different
processes, or even to different computing nodes, so as to avoid mutual influence

3. Can't split able, the computing resources used by the same data can't be
flexibly adjusted
For the same piece of data, we often hope that a very important query needs to
run quickly to get the result. I can allocate a lot of computing resources to
it to get the result as soon as possible. For those unimportant tasks, we can
allocate a few processes to run slowly, That is, it can't dynamically adjust
and slice computing resources, it can only bind fixed processes to compute

4. Multiple systems cannot communicate with each other
Most of the time, I hope that the index format of lxdb can be more open and run
directly in other systems without any change. Just like hive, I create a data
table and define the parquet format. Besides hive itself, impala can directly
access its data, Presto and spark can also access it. This system is more
flexible

The current way of binding process between HBase and Lucene makes the data in
lxdb of other systems can only be transferred once through the service of lxdb
and the resident process of lxdb, which greatly affects the efficiency and
increases the complexity of interworking between multiple systems. We prefer to
interweave in the file layer directly through the format of type parquet
without transfer service

==How do we plan to solve this problem==
1. we don't plan to shave Lucene

Lucene is still the king in the field of full-text retrieval and
multi-dimensional retrieval. There are no comparison between various
performance indicators. I have measured various data formats or database
systems. But in this field, there is no way to surpass Lucene, and there is no
one saying that Lucene is the level of "Wang". At present, the popular Solr and
elasticsearch also rely directly or indirectly on lucene

2. we plan to transform Lucene

Lucene's core is inverted index, which involves the storage formats of forward
and backward. We intend to keep these concepts and API interfaces, and the
logic remains unchanged

But the implementation of inverted and forward row is replaced by the original
blocktree and block compressed FDT and docvalues stored by columns. In fact, we
find that the format of the nested column storage is very similar to the
inverted index. Only when many people use parquet, the data storage is random,
After we move parquet into Lucene framework, because of the ordered nature of
inverted tables, the performance of parquet will be particularly good.
Moreover, Lucene's original inverted table can only be single-dimensional.
After we replace with parquet, it will become multidimensional, or stored by
column

3. after Lucene is modified, the open index is not loaded into memory in
advance. Paquet format does not involve preloading, and the opening speed of
index without preload is much faster. It can be loaded dynamically in different
processes according to the calculation requirements

4. after the inverted table of Lucene is modified to realize parquet, the
inverted table is made into multi-dimensional inverted table, which makes the
inverted table be used for further statistical analysis. Lucene is not good at
traversing the inverted table. Because it involves the collaborative linkage of
multiple files such as tip, Tim, Doc, pay, POS, and about 6-9 different pointer
files are pointed to, We will find that the original Lucene will cause the
system resources to be very high and even run up if there is a * or other
unified retrieval. After replacing with the implementation of parquet, these
files are unified and merged into one file. Moreover, the traversal performance
of parquet is very good. Moreover, parquet has a layer of coarse index based on
Lucene index, and it will jump by block, This will make the efficiency of multi
condition merging and the combination of statistical analysis more effective
than the original skiplist, and the original payload in Lucene does not have
any filtering ability. Now it can be filtered through parquet. At the
statistical analysis level, parquet will be better than Lucene in performance
due to less random reading

5. it is easy to split table based on parquet. Originally, it was the same
index with 10million records. I can split 10 points without one million
records, and give them to 10 different processes to calculate, which is
impossible in Lucene

6. achieve the real sense of read-write separation. The original HBase after
transformation is only used to maintain data writing and process Lucene index
merging. It will also create some lightweight real-time snapshots for queries
to prevent the old segment from being deleted during index merging. All query
requests are separated, which can be run in the query process in lxdb, or run
in hive, spark, Even in impala. Because the index is parquet, it can be read
directly by wrapping it into an inputformat or RDD

7. it is more suitable for cloud native environment

For real-time write, we only need to maintain a certain HBase process to
maintain the continuous real-time write of Lucene index, while all queries are
started on demand in the cloud native environment, and resources will be
released without query. However, when Lucene generates data, there will be some
data temporarily in memory. Here, we will consider some short-term and
timeliness issues, This part of index can be stored in distributed memory file
system such as alloxo or directly stored in kV system by NRT technology, but
this is not the focus of the near future

伴随着LXDB的落地,总是觉得欠缺着什么,LXDB在开干之前,我自己拿着笔,画了一下什么样的数据库才是我心目中完美的数据库,现在看主要设计目标已经完成,但还是觉得不够完美,究其本质,录信的核心lucene还不够完美,不能splitable,在云原生环境尚有不足.
LXDB的本质是将spark,hbase,lucene三款业界最硬核的大数据产品合体到一个产品,就像完美小时候看的葫芦娃,7个葫芦兄弟合并成一个大的金刚葫芦娃威力更加强大,他具备spark强大的olap分析能力,又具备hbase高时效的实时更新的能力,还能借助lucene索引进行快速的多维筛选,粗看起来近乎是一个完美的产品,几乎能满足目前大数据领域大部分的场景，完美的分布式存储，分布式计算，高并发还灵活，市面上大部分产品都不能同时具备LXDB在技术上追求的完美,kudu的时效,spark的OLAP性能,ES的全文检索,Hbase的高并发。但是真实使用起来就是感觉有一些让人很不爽的地方，我来一一举例。
一、 存在的问题
1. 必须常驻进程，不能按需使用
Lucene与hbase 存在的弊端,一旦服务启动,这个进程必须是常驻进程,无论有没有查询,有没有数据导入,这些进程必须一致挂在上面.
而我更期望的情况是,他能像原生的spark那样,有SQL查询的时候,去现去启动一些进程,当这些进程没有人使用,在慢慢的回收掉.
2. 同一份数据的不同计算没有分离,不能实现计算的资源隔离
常驻进程的另一个弊端是所有的计算必须通过常驻的进程去读取,而很多时候任务是有优先级的,即席查询的任务要求的响应速度远远高于批处理查询任务,我们更希望给即席查询的任务更多更快的资源,而让批处理的任务在后台慢慢的运行.
而常驻进程在这方面给我们带来了很多困扰,我们更希望计算分离开,不同类型的任务隔离到不同的进程中,甚至隔离到不同的计算节点上,以免之间相互影响.
3. 不能split able,同一份数据使用的计算资源不可灵活调整
对于同一份数据,我们很多时候更希望,一个非常重要的查询,需要快速的运行出结果,我可以给他分配非常多的计算资源,让这个语句尽快的查询出结果.而那些不重要的任务,给他分配少量的几个进程慢慢去运行.而目前的hbase和lucene存在的问题是不能split
able,即不能根据计算资源动态调整与切分分片,只能绑定固定的进程去计算.
4. 多个系统不能互通
很多时候我更希望,lxdb的索引格式能更加开放,啥也不改动能直接运行在别的系统里面,就像hive那种,我创建一个数据表定义了parquet格式,除了hive本身可以查询外,impala可以直接访问他的数据,presto,spark也可以访问,这样的系统更加灵活.
而目前hbase和lucene绑定进程的方式,使得其他系统范围lxdb中的数据,只能通过lxdb的服务,通过lxdb的常驻进程去中转一次,这非常影响了效率,且导致多个系统之间互通的复杂性增加,我们更希望能直接通过类型parquet这种格式,在文件层互通,而没有中转服务.
二、 我们打算如何解决这个问题
1. 我们并不打算剃掉lucene
Lucene在全文检索与多维检索领域,目前还是王道,各种性能指标没有与之能对比的,我测了了各种各样的数据格式或数据库系统,但在这个领域没有能超越lucene的,不存在之一的说法,倒排索引lucene就是”王”的级别,目前流行的solr与elasticsearch也是直接或间接依赖lucene.
2. 我们打算改造lucene
Lucene的核心是倒排索引,其存储格式涉及正排与倒排等存储格式.我们打算依然保存这些概念与API接口,逻辑依然不变.
但是倒排与正排的实现由原先的blockTree与按块压缩的fdt与按列存储的docvalues,统一替换成parquet这种嵌套列存储的格式.而实际上我们发现parquet这种嵌套列存储的格式跟倒排索引真的很像很像,只不过很多人使用parquet的时候数据存储是随机存储的,而我们将parquet移入进lucene框架里面以后,因为倒排表有序的特性,会让parquet的性能变得尤其好.且lucene原本的倒排表只能是单维的,我们更换为parquet后就变成了多维的了,还是按列存储.
3.
改造lucene后,打开索引不在需要预先加载数据到内存里,paquet格式不涉及预加载,没了预加载打开索引的速度会快很多,可以根据计算要求在不同的进程中动态加载.
4.
改造lucene的倒排表为parquet实现后,使得倒排表成为多维的倒排表,这使得倒排表可以进一步的做统计分析使用,原本lucene遍历倒排表性能是不怎么好的,因为涉及tip,tim,doc,pay,pos等多个文件的协同联动,还涉及大约6~9个不同的指针文件指向来指向去的,我们就会发现原先的lucene只要出现*之类的统配检索,就会导致系统资源彪的很高,甚至跑挂.而更换为parquet实现后,这些文件都统一合并成了1个文件,而parquet本身的遍历性能很不错,且parquet本身在lucene索引的基础上还带了一层粗索引,按块跳跃的统计,这会使得多条件的归并的效率以及统计分析的结合上会比原先的skiplist更有效,且原本lucene中的payload本身不具备任何的筛选能力,现在可以通过parquet做筛选了.且在统计分析层面,parquet因为更少的随机读,性能上会比lucene好.
5. 基于parquet做split
able很容易,原本是同一个索引,有1000万条记录,我可以split成10分,没分100万条记录,分别交给10个不同的进程去计算,而这原先在lucene上是不可能的.
6.
做到真正意义上的读写分离,改造后原本的hbase仅仅用于维护数据的写入,处理lucene的索引合并.也会为查询创建一些轻量级的实时快照,防止索引合并的过程中,旧的segment被删掉.而所有的查询请求都分离出去,可以运行在lxdb中的查询进程中,也可以运行在hive,spark,甚至impala中.因为索引就是parquet,包装成一个inputformat或rdd就可以直接读取了.
7. 更适合云原生环境
对于实时写入,我们仅仅需要维护一定的hbase进程去维护lucene的索引的持续实时写入,而所有的查询则在云原生环境中按需启动,没查询则释放资源.不过因为lucene在生成数据的时候,会有一些数据暂存在内存中去,这里会涉及一些短暂是时效性的问题,我们考虑,可以通过nrt技术将这部分索引存在在分布式内存文件系统如alluxio中,或者直接存储在kv系统里面,来解决这部分时效性问题,不过这不是近期的重点.

f...@lucene.cn yannian mu
www.lucene.xin

Modify Lucene to make it an inverted index suitable for cloud native environment

Reply via email to