[ 
https://issues.apache.org/jira/browse/CASSANDRA-11130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15136448#comment-15136448
 ] 

DOAN DuyHai commented on CASSANDRA-11130:
-----------------------------------------

I've though about one possible way to provide the strict {{=}} semantics when 
using StandardAnalyzer.

 On SASI side, you still hit disk to fetch all matching terms but then you 
perform a post-processing to return only exact match.

 I don't know whether you store the source column value in SASI index or not. 
If yes it should be easy. If no, then it'll be expensive because we'll hit 
Cassandra SSTables before being able to filter out non exact matches

> [SASI Pre-QA] = semantics not respected when using StandardAnalyzer
> -------------------------------------------------------------------
>
>                 Key: CASSANDRA-11130
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-11130
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CQL
>         Environment: Tested from build 
> [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
>            Reporter: DOAN DuyHai
>            Assignee: Pavel Yaskevich
>
> Tested from build 
> [CASSANDRA-11067|https://issues.apache.org/jira/browse/CASSANDRA-11067]
> {code:sql}
> CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': '1'}  AND durable_writes = true;
> CREATE TABLE music.albums (
>     id int PRIMARY KEY,
>     artist text,
>     title1 text,
>     title2 text
> );
> CREATE CUSTOM INDEX ON music.albums (title1) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = 
> {'tokenization_skip_stop_words': 'true', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', 
> 'case_sensitive': 'false', 'mode': 'PREFIX', 'tokenization_enable_stemming': 
> 'true'};
> CREATE CUSTOM INDEX ON music.albums (title2) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = 
> {'tokenization_skip_stop_words': 'true', 'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.StandardAnalyzer', 
> 'case_sensitive': 'false', 'mode': 'CONTAINS', 
> 'tokenization_enable_stemming': 'true'};
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(1, 'Superpitcher', 'Yesterday', 'Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(2, 'Hilary Duff', 'So Yesterday', 'So Yesterday');
> INSERT INTO music.albums(id, artist, title1, title2) 
> VALUES(3, 'The Mr. T Experience', 'Yesterday Rules', 'Yesterday Rules');
> SELECT artist,title1 FROM music.albums WHERE title1='Yesterday';
>  artist                 | title1
> ------------------------+----------------
>            Superpitcher |       Yesterday
>             Hilary Duff |    So Yesterday
>    The Mr. T Experience | Yesterday Rules
>  
> (3 rows)
> SELECT artist,title1 FROM music.albums WHERE title2='Yesterday';
> artist                 | title1
> ------------------------+----------------
>            Superpitcher |       Yesterday
>             Hilary Duff |    So Yesterday
>    The Mr. T Experience | Yesterday Rules
>   
> (3 rows)
> {code}
> The semantic of *=* is not respected. SASI should return only 1 row with 
> exact match. Using *LIKE* would return all 3 rows. It does impact both 
> *PREFIX* and *CONTAINS* mode. Using *NonTokenizerAnalyzer* return 1 row with 
> exact match.
>  So indeed, the semantics of *=* depends on the chosen analyzer, which is 
> inconsistent. We should force *=* to be exact match no matter which analyzer 
> is chosen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to