ZhangYu0123 opened a new pull request, #19021:
URL: https://github.com/apache/doris/pull/19021
# Proposed changes
**Support token_bf index for token search:**
1. Token_bf index is mainly used to optimise English text searching
accurately. It can split sentences by non-numeric and non-characters and
construct bloom filter. When searching by like、not like、startsWith、in、not
in、endswith, it can accelerate searching time. This pr is only support like.
2. vs ngram_bf index, In English text
(1) Token_bf index has 100% up.
(2) It doesn't need to provide ngram_size parameter.
3. vs inverted index
case sensitive
4. Limitation
In like '%xxx%' sql, token_bf index will not be used. Because the bloom
filter records the whole token and it can't process part of it. We can use
like '% xxx %' or hastoken(xxx) function to process.
**Test:**
2kw data, BUCKETS 1
```
CREATE TABLE IF NOT EXISTS hits_url4 (
UserID int,
url text DEFAULT '',
url_ngram3 text DEFAULT '',
url_ngram6 text DEFAULT '',
url_token text DEFAULT '',
url_inverted text DEFAULT '',
INDEX idx_ngrambf (`url_ngram3`) USING NGRAM_BF
PROPERTIES("gram_size"="3", "bf_size"="1024") COMMENT 'url_ngram ngram_bf
index',
INDEX idx_ngrambf2 (`url_ngram6`) USING NGRAM_BF
PROPERTIES("gram_size"="6", "bf_size"="1024") COMMENT 'url_ngram ngram_bf
index',
INDEX url_token (`url_token`) USING TOKEN_BF
PROPERTIES("bf_size"="1024") COMMENT 'url_token_bf index',
INDEX idx_inverted (`url_inverted`) USING INVERTED
PROPERTIES("parser"="english") COMMENT 'url_inverted index'
)
DUPLICATE KEY(UserID)
DISTRIBUTED BY HASH(UserID) BUCKETS 1
PROPERTIES("replication_num" = "1")
```
| index type | speed | up |
|--------|--------|--------|
| none | 0.76s <img width="618" alt="image"
src="https://user-images.githubusercontent.com/67053339/233016348-dca7b81d-1ff8-4fb2-811a-02c09d7f8ce3.png">
| - |
| ngram_bf gram=6 | 0.56s <img width="656" alt="image"
src="https://user-images.githubusercontent.com/67053339/233034418-9b304548-b1c4-429d-8321-ef8c56fdc8f1.png">
| 36% |
| ngram_bf gram=3 | 0.17s <img width="666" alt="image"
src="https://user-images.githubusercontent.com/67053339/233015812-a425c8b5-cfd2-48b1-9f32-0cbe0bc34409.png">
| 347% |
| token_bf | 0.08s <img width="667" alt="image"
src="https://user-images.githubusercontent.com/67053339/233014026-8f969ecf-b2ba-4c8f-9c7e-381a434a5bc6.png">
| 850% |
Issue Number: close #xxx
## Problem summary
Describe your changes.
## Checklist(Required)
* [ ] Does it affect the original behavior
* [ ] Has unit tests been added
* [ ] Has document been added or modified
* [ ] Does it need to update dependencies
* [ ] Is this PR support rollback (If NO, please explain WHY)
## Further comments
If this is a relatively large or complex change, kick off the discussion at
[[email protected]](mailto:[email protected]) by explaining why you
chose the solution you did and what alternatives you considered, etc...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]