[ https://issues.apache.org/jira/browse/CASSANDRA-12073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15347495#comment-15347495 ]
Pavel Yaskevich commented on CASSANDRA-12073: --------------------------------------------- [~doanduyhai] Can you please assign a proper patch so we can run it through CI and integrate? > [SASI] PREFIX search on CONTAINS/NonTokenizer mode returns only partial > results > ------------------------------------------------------------------------------- > > Key: CASSANDRA-12073 > URL: https://issues.apache.org/jira/browse/CASSANDRA-12073 > Project: Cassandra > Issue Type: Bug > Components: CQL > Environment: Cassandra 3.7 > Reporter: DOAN DuyHai > > {noformat} > cqlsh:music> CREATE TABLE music.albums ( > id uuid PRIMARY KEY, > artist text, > country text, > quality text, > status text, > title text, > year int > ); > cqlsh:music> CREATE CUSTOM INDEX albums_artist_idx ON music.albums (artist) > USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': > 'CONTAINS', 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', > 'case_sensitive': 'false'}; > cqlsh:music> SELECT * FROM albums WHERE artist like 'lady%' LIMIT 100; > id | artist | country | quality > | status | title | year > --------------------------------------+-----------+----------------+---------+-----------+---------------------------+------ > 372bb0ab-3263-41bc-baad-bb520ddfa787 | Lady Gaga | USA | normal > | Official | Red and Blue EP | 2006 > 1a4abbcd-b5de-4c69-a578-31231e01ff09 | Lady Gaga | Unknown | normal > | Promotion | Poker Face | 2008 > 31f4a0dc-9efc-48bf-9f5e-bfc09af42b82 | Lady Gaga | USA | normal > | Official | The Cherrytree Sessions | 2009 > 8ebfaebd-28d0-477d-b735-469661ce6873 | Lady Gaga | Unknown | normal > | Official | Poker Face | 2009 > 98107d82-e0dd-46bc-a273-1577578984c7 | Lady Gaga | USA | normal > | Official | Just Dance: The Remixes | 2008 > a76af0f2-f5c5-4306-974a-e3c17158e6c6 | Lady Gaga | Italy | normal > | Official | The Fame | 2008 > 849ee019-8b15-4767-8660-537ab9710459 | Lady Gaga | USA | normal > | Official | Christmas Tree | 2008 > 4bad59ac-913f-43da-9d48-89adc65453d2 | Lady Gaga | Australia | normal > | Official | Eh Eh | 2009 > 80327731-c450-457f-bc12-0a8c21fd9c5d | Lady Gaga | USA | normal > | Official | Just Dance Remixes Part 2 | 2008 > 3ad33659-e932-4d31-a040-acab0e23c3d4 | Lady Gaga | Unknown | normal > | null | Just Dance | 2008 > 9adce7f6-6a1d-49fd-b8bd-8f6fac73558b | Lady Gaga | United Kingdom | normal > | Official | Just Dance | 2009 > (11 rows) > {noformat} > *SASI* says that there are only 11 artists whose name starts with {{lady}}. > However, in the data set, there are: > * Lady Pank > * Lady Saw > * Lady Saw > * Ladyhawke > * Ladytron > * Ladysmith Black Mambazo > * Lady Gaga > * Lady Sovereign > etc ... > By debugging the source code, the issue is in > {{OnDiskIndex.TermIterator::computeNext()}} > {code:java} > for (;;) > { > if (currentBlock == null) > return endOfData(); > if (offset >= 0 && offset < currentBlock.termCount()) > { > DataTerm currentTerm = currentBlock.getTerm(nextOffset()); > if (checkLower && !e.isLowerSatisfiedBy(currentTerm)) > continue; > // flip the flag right on the first bounds match > // to avoid expensive comparisons > checkLower = false; > if (checkUpper && !e.isUpperSatisfiedBy(currentTerm)) > return endOfData(); > return currentTerm; > } > nextBlock(); > } > {code} > So the {{endOfData()}} conditions are: > * currentBlock == null > * checkUpper && !e.isUpperSatisfiedBy(currentTerm) > The problem is that {{e::isUpperSatisfiedBy}} is checking not only whether > the term match but also returns *false* when it's a *partial term* ! > {code:java} > public boolean isUpperSatisfiedBy(OnDiskIndex.DataTerm term) > { > if (!hasUpper()) > return true; > if (nonMatchingPartial(term)) > return false; > int cmp = term.compareTo(validator, upper.value, false); > return cmp < 0 || cmp == 0 && upper.inclusive; > } > {code} > By debugging the OnDiskIndex data, I've found: > {noformat} > ... > Data Term (partial ? false) : lady gaga. 0x0, TokenTree offset : 21120 > Data Term (partial ? true) : lady of bells. 0x0, TokenTree offset : 21360 > Data Term (partial ? false) : lady pank. 0x0, TokenTree offset : 21440 > ... > {noformat} > *lady gaga* did match but then *lady of bells* fails because it is a partial > match. Then SASI returns {{endOfData()}} instead continuing to check *lady > pank* > One quick & dirty fix I've found is to add > {{!e.nonMatchingPartial(currentTerm)}} to the *if* condition > {code:java} > if (checkUpper && !e.isUpperSatisfiedBy(currentTerm) && > !e.nonMatchingPartial(currentTerm)) > return endOfData(); > {code} > I feel it kinda dirty and I suspect this bug to has more impact than the > *PREFIX* case > Wdyt [~xedin] [~beobal] [~jrwest] ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)