Thank you for the investigation. Will wait for a fix and news.

Probably it’s not a directly related question but what do you think about 
CASSANDRA-12573? Let me know if it’s better to create a separate thread for it.

Thanks,
Mikhail

> On 15 Sep 2016, at 16:02, DuyHai Doan <doanduy...@gmail.com> wrote:
> 
> Ok so I've found the source of the issue, it's pretty well hidden because it 
> is NOT in the SASI source code directly.
> 
> Here is the method where C* determines what kind of LIKE expression you're 
> using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES)
> 
> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778
>  
> <https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778>
> 
> As you can see, it's pretty simple, maybe too simple. Indeed, they forget to 
> remove escape character BEFORE doing the matching so if your search is LIKE 
> '%%esc%', the detected expression is LIKE_CONTAINS.
> 
> A possible fix would be:
> 
> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on the 
> column data type)
> 2) remove the escape character e.g. before parsing OR use some advanced regex 
> to exclude the %% from parsing e.g
> 
> Step 2) is dead easy but step 1) is harder because I don't know if converting 
> the bytebuffer into String at this stage of the CQL parser is expensive or 
> not (in term of computation)
> 
> Let me try a patch  
> 
> 
> 
> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduy...@gmail.com 
> <mailto:doanduy...@gmail.com>> wrote:
> Ok you're right, I get your point
> 
> LIKE '%%esc%' --> startWith('%esc')
> 
> LIKE 'escape%%' -->  = 'escape%'
> 
> What I strongly suspect is that in the source code of SASI, we parse the % 
> xxx % expression BEFORE applying escape. That will explain the observed 
> behavior. E.g:
> 
> LIKE '%%esc%'  parsed as %xxx% where xxx = %esc
> 
> LIKE 'escape%%' parsed as xxx% where xxx =escape%
> 
> Let me check in the source code and try to reproduce the issue
> 
> 
> 
> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy 
> <mikhail.krupits...@jetbrains.com <mailto:mikhail.krupits...@jetbrains.com>> 
> wrote:
> Looks like we have different understanding of what results are expected.
> I based my understanding on 
> http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html 
> <http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html>
> According to the doc ‘esc’ is a pattern for exact match and I guess that 
> there is no semantical difference between two LIKE patterns (both of patterns 
> should be treated as ‘exact match'): ‘%%esc’ and ‘esc’.
> 
>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
>> containing '%esc' so %escapeme is a possible match and also escape%esc
> Why ‘containing’? I expect that it should be ’starting’..
>> 
>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results starting 
>> with 'escape%' so escape%me is a valid result and also escape%esc
> Why ’starting’? I expect that it should be ‘exact matching’.
> 
> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it returns 
> nothing (CASSANDRA-12573).
> 
> What I’m missing?
> 
> Thanks,
> Mikhail
> 
>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduy...@gmail.com 
>> <mailto:doanduy...@gmail.com>> wrote:
>> 
>> CREATE CUSTOM INDEX ON test.escape(val) USING 'org.apache.cassandra.index.sa 
>> <http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = {'mode': 
>> 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa 
>> <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer', 
>> 'case_sensitive': 'false'};
>> 
>> I don't see any problem in the results you got
>> 
>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results 
>> containing '%esc' so %escapeme is a possible match and also escape%esc
> Why ‘containing’? I expect that it should be ’starting’..
>> 
>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results starting 
>> with 'escape%' so escape%me is a valid result and also escape%esc
> Why ’starting’? I expect that it should be ‘exact matching’.
> 
>> 
>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy 
>> <mikhail.krupits...@jetbrains.com <mailto:mikhail.krupits...@jetbrains.com>> 
>> wrote:
>> Thanks for the reply.
>> Could you please provide what index definition did you use?
>> With the index from my script I get the following results:
>> 
>> cqlsh:test> select * from escape;
>> 
>>  id | val
>> ----+-----------
>>   1 | %escapeme
>>   2 | escape%me
>>   3 | escape%esc
>> 
>> Contains search
>> 
>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>> 
>>  id | val
>> ----+-----------
>>   1 | %escapeme
>>   3 | escape%esc
>> (2 rows)
>> 
>> 
>> Prefix search
>> 
>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>> 
>>  id | val
>> ----+-----------
>>   2 | escape%me
>>   3 | escape%esc
>> 
>> Thanks,
>> Mikhail 
>> 
>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduy...@gmail.com 
>>> <mailto:doanduy...@gmail.com>> wrote:
>>> 
>>> Use % to escape %
>>> 
>>> cqlsh:test> select * from escape;
>>> 
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>>   2 | escape%me
>>> 
>>> 
>>> Contains search
>>> 
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%';
>>> 
>>>  id | val
>>> ----+-----------
>>>   1 | %escapeme
>>> 
>>> (1 rows)
>>> 
>>> 
>>> Prefix search
>>> 
>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%';
>>> 
>>>  id | val
>>> ----+-----------
>>>   2 | escape%me
>>> 
>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy 
>>> <mikhail.krupits...@jetbrains.com 
>>> <mailto:mikhail.krupits...@jetbrains.com>> wrote:
>>> Hi Cassandra guys,
>>> 
>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a simple char in a 
>>> search pattern.
>>> Here is my test script:
>>> 
>>> DROP keyspace if exists kmv;
>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class' : 
>>> 'SimpleStrategy', 'replication_factor':'1'} ;
>>> USE kmv;
>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY KEY(id, 
>>> c1));
>>> CREATE CUSTOM INDEX ON kmv.kmv  ( c2 ) USING 'org.apache.cassandra.index.sa 
>>> <http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = {
>>> 'analyzed' : 'true',
>>> 'analyzer_class' : 'org.apache.cassandra.index.sa 
>>> <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer',
>>> 'case_sensitive' : 'false',
>>> 'mode' : 'CONTAINS'
>>> };
>>> 
>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd');
>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd');
>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%');
>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1');
>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd');
>>> 
>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_';
>>> 
>>> _pattern_ '%%%' finds all columns that contain %.
>>> How to find columns that start form ‘%’ or ‘%a’?
>>> How to find columns that end with ‘%’?
>>> What about more complex patterns: '%qwe%a%sd%’? How to differentiate ‘%’ 
>>> char form % as a command symbol? (Also there is a related issue 
>>> CASSANDRA-12573).
>>> 
>>> 
>>> Thanks,
>>> Mikhail
>>> 
>> 
>> 
> 
> 
> 

Reply via email to