Please see my comments inline. Thanks, Mikhail > On 26 Sep 2016, at 17:07, DuyHai Doan <doanduy...@gmail.com> wrote: > > "In the current implementation (‘%’ could be a wildcard only at the start/end > of a term) I guess it should be ’ENDS with ‘%escape’ ‘." > > --> Yes in the current impl, it means ENDS WITH '%escape' but we want SASI to > understand the %% as an escape for % so the goal is that SASI understands > LIKE '%%escape' as EQUALS TO '%escape'. Am I correct ? I guess that the goal is to define a way to use ‘%’ as a simple char. LIKE '%escape' - ENDS WITH 'escape' LIKE '%%escape' - EQUALS TO '%escape’ LIKE '%%escape%' - STARTS WITH '%escape’
LIKE ‘%%%escape’ - undefined in general case LIKE ‘%%%escape’ - ENDS WITH “%escape” in a case when we know that a wildcard could be only at the start/end. > > "Moreover all terms that contains single ‘%’ somewhere in the middle should > cause an exception." > > --> Not necessarily, sometime people may want to search text pattern > containing the literal %. Imagine the text "this year the average income has > increase by 10%". People may want to search for "10%”. If someone wants to search for ’10%’ then he should escape the ‘%’ char: like “10%%”. > > > > "BUT may be it’s better to make escaping more universal to support a future > possible case where a wildcard could be placed in the middle of a term too?" > > --> I guess universal escaping for % is the cleaner and better solution. > However it may involve some complex regular expression. I'm not sure that > input.replaceAll("%%", "%") trick would work for any cases. As I wrote I don’t think that it’s possible to do universal escaping using ‘%’ as an escape char (a char to escape wildcard char to make it a simple char semantically) and as wildcard at the same time. I suggest to use “\” as an escape char. Also I don’t know enough about Cassandra’s internals to estimate how universal escaping will affect performance. It really looks like a better solution for Cassandra users. > > And we also need to define when we want to detect operation type > (LIKE_PREFIX, LIKE_SUFFIX, LIKE_CONTAINS, EQUAL) ? > > Should we detect operation type BEFORE escaping or AFTER escaping ? As I understand ‘escaping' will be done by users. So on DB level we get an already escaped string from a request and it’s possible to know which symbol is a wildcard and which is just a char. I guess that Cassandra should parse (unescape?) an incoming string to define wildcards positions and simple chars positions and then define an operation type. > > > > > > On Mon, Sep 26, 2016 at 3:54 PM, Mikhail Krupitskiy > <mikhail.krupits...@jetbrains.com <mailto:mikhail.krupits...@jetbrains.com>> > wrote: >> LIKE '%%%escape' --> EQUALS TO '%%escape' ??? > In the current implementation (‘%’ could be a wildcard only at the start/end > of a term) I guess it should be ’ENDS with ‘%escape’ ‘. > Moreover all terms that contains single ‘%’ somewhere in the middle should > cause an exception. > BUT may be it’s better to make escaping more universal to support a future > possible case where a wildcard could be placed in the middle of a term too? > > Thanks, > Mikhail >> On 24 Sep 2016, at 21:09, DuyHai Doan <doanduy...@gmail.com >> <mailto:doanduy...@gmail.com>> wrote: >> >> Reminder, right now, the % character is only interpreted as wildcard IF AND >> ONLY IF it is the first/last character of the searched term >> >> >> LIKE '%escape' --> ENDS WITH 'escape' >> >> If we use % to escape %, >> LIKE '%%escape' --> EQUALS TO '%escape' >> >> LIKE '%%%escape' --> EQUALS TO '%%escape' ??? >> >> >> >> >> On Fri, Sep 23, 2016 at 5:02 PM, Mikhail Krupitskiy >> <mikhail.krupits...@jetbrains.com <mailto:mikhail.krupits...@jetbrains.com>> >> wrote: >> Hi, Jim, >> >> What pattern should be used to search “ends with ‘%escape’ “ with your >> conception? >> >> Thanks, >> Mikhail >> >>> On 22 Sep 2016, at 18:51, Jim Ancona <j...@anconafamily.com >>> <mailto:j...@anconafamily.com>> wrote: >>> >>> To answer DuyHai's question without introducing new syntax, I'd suggest: >>>> LIKE '%%%escape' means STARTS WITH '%' AND ENDS WITH 'escape' >>> So the first two %'s are translated to a literal, non-wildcard % and the >>> third % is a wildcard because it's not doubled. >>> >>> Jim >>> >>> On Thu, Sep 22, 2016 at 11:40 AM, Mikhail Krupitskiy >>> <mikhail.krupits...@jetbrains.com >>> <mailto:mikhail.krupits...@jetbrains.com>> wrote: >>> I guess that it should be similar to how it is done in SQL for LIKE >>> patterns. >>> >>> You can introduce an escape character, e.g. ‘\’. >>> Examples: >>> ‘%’ - any string >>> ‘\%’ - equal to ‘%’ character >>> ‘\%foo%’ - starts from ‘%foo’ >>> ‘%%%escape’ - ends with ’escape’ >>> ‘\%%’ - starts from ‘%’ >>> ‘\\\%%’ - starts from ‘\%’ . >>> >>> What do you think? >>> >>> Thanks, >>> Mikhail >>>> On 22 Sep 2016, at 16:47, DuyHai Doan <doanduy...@gmail.com >>>> <mailto:doanduy...@gmail.com>> wrote: >>>> >>>> Hello Mikhail >>>> >>>> It's more complicated that it seems >>>> >>>> LIKE '%%escape' means EQUAL TO '%escape' >>>> >>>> LIKE '%escape' means ENDS WITH 'escape' >>>> >>>> What's about LIKE '%%%escape' ???? >>>> >>>> How should we treat this case ? Replace %% by % at the beginning of the >>>> searched term ?? >>>> >>>> >>>> >>>> On Thu, Sep 22, 2016 at 3:31 PM, Mikhail Krupitskiy >>>> <mikhail.krupits...@jetbrains.com >>>> <mailto:mikhail.krupits...@jetbrains.com>> wrote: >>>> Hi! >>>> >>>> We’ve talked about two items: >>>> 1) ‘%’ as a wildcard in the middle of LIKE pattern. >>>> 2) How to escape ‘%’ to be able to find strings with the ‘%’ char with >>>> help of LIKE. >>>> >>>> Item #1was resolved as CASSANDRA-12573. >>>> >>>> Regarding to item #2: you said the following: >>>>> A possible fix would be: >>>>> >>>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on >>>>> the column data type) >>>>> 2) remove the escape character e.g. before parsing OR use some advanced >>>>> regex to exclude the %% from parsing e.g >>>>> >>>>> Step 2) is dead easy but step 1) is harder because I don't know if >>>>> converting the bytebuffer into String at this stage of the CQL parser is >>>>> expensive or not (in term of computation) >>>>> >>>>> Let me try a patch >>>> >>>> So is there any update on this? >>>> >>>> Thanks, >>>> Mikhail >>>> >>>> >>>>> On 20 Sep 2016, at 18:38, Mikhail Krupitskiy >>>>> <mikhail.krupits...@jetbrains.com >>>>> <mailto:mikhail.krupits...@jetbrains.com>> wrote: >>>>> >>>>> Hi! >>>>> >>>>> Have you had a chance to try your patch or solve the issue in an other >>>>> way? >>>>> >>>>> Thanks, >>>>> Mikhail >>>>>> On 15 Sep 2016, at 16:02, DuyHai Doan <doanduy...@gmail.com >>>>>> <mailto:doanduy...@gmail.com>> wrote: >>>>>> >>>>>> Ok so I've found the source of the issue, it's pretty well hidden >>>>>> because it is NOT in the SASI source code directly. >>>>>> >>>>>> Here is the method where C* determines what kind of LIKE expression >>>>>> you're using (LIKE_PREFIX , LIKE CONTAINS or LIKE_MATCHES) >>>>>> >>>>>> https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778 >>>>>> >>>>>> <https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/cql3/restrictions/SingleColumnRestriction.java#L733-L778> >>>>>> >>>>>> As you can see, it's pretty simple, maybe too simple. Indeed, they >>>>>> forget to remove escape character BEFORE doing the matching so if your >>>>>> search is LIKE '%%esc%', the detected expression is LIKE_CONTAINS. >>>>>> >>>>>> A possible fix would be: >>>>>> >>>>>> 1) convert the bytebuffer into plain String (UTF8 or ASCII, depending on >>>>>> the column data type) >>>>>> 2) remove the escape character e.g. before parsing OR use some advanced >>>>>> regex to exclude the %% from parsing e.g >>>>>> >>>>>> Step 2) is dead easy but step 1) is harder because I don't know if >>>>>> converting the bytebuffer into String at this stage of the CQL parser is >>>>>> expensive or not (in term of computation) >>>>>> >>>>>> Let me try a patch >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Sep 14, 2016 at 9:42 AM, DuyHai Doan <doanduy...@gmail.com >>>>>> <mailto:doanduy...@gmail.com>> wrote: >>>>>> Ok you're right, I get your point >>>>>> >>>>>> LIKE '%%esc%' --> startWith('%esc') >>>>>> >>>>>> LIKE 'escape%%' --> = 'escape%' >>>>>> >>>>>> What I strongly suspect is that in the source code of SASI, we parse the >>>>>> % xxx % expression BEFORE applying escape. That will explain the >>>>>> observed behavior. E.g: >>>>>> >>>>>> LIKE '%%esc%' parsed as %xxx% where xxx = %esc >>>>>> >>>>>> LIKE 'escape%%' parsed as xxx% where xxx =escape% >>>>>> >>>>>> Let me check in the source code and try to reproduce the issue >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Sep 13, 2016 at 7:24 PM, Mikhail Krupitskiy >>>>>> <mikhail.krupits...@jetbrains.com >>>>>> <mailto:mikhail.krupits...@jetbrains.com>> wrote: >>>>>> Looks like we have different understanding of what results are expected. >>>>>> I based my understanding on >>>>>> http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html >>>>>> <http://docs.datastax.com/en/cql/3.3/cql/cql_using/useSASIIndex.html> >>>>>> According to the doc ‘esc’ is a pattern for exact match and I guess that >>>>>> there is no semantical difference between two LIKE patterns (both of >>>>>> patterns should be treated as ‘exact match'): ‘%%esc’ and ‘esc’. >>>>>> >>>>>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results >>>>>>> containing '%esc' so %escapeme is a possible match and also escape%esc >>>>>> Why ‘containing’? I expect that it should be ’starting’.. >>>>>>> >>>>>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results >>>>>>> starting with 'escape%' so escape%me is a valid result and also >>>>>>> escape%esc >>>>>> Why ’starting’? I expect that it should be ‘exact matching’. >>>>>> >>>>>> Also I expect that “ LIKE ‘%s%sc%’ ” will return ‘escape%esc’ but it >>>>>> returns nothing (CASSANDRA-12573). >>>>>> >>>>>> What I’m missing? >>>>>> >>>>>> Thanks, >>>>>> Mikhail >>>>>> >>>>>>> On 13 Sep 2016, at 19:31, DuyHai Doan <doanduy...@gmail.com >>>>>>> <mailto:doanduy...@gmail.com>> wrote: >>>>>>> >>>>>>> CREATE CUSTOM INDEX ON test.escape(val) USING >>>>>>> 'org.apache.cassandra.index.sa >>>>>>> <http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = >>>>>>> {'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sa >>>>>>> <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer', >>>>>>> 'case_sensitive': 'false'}; >>>>>>> >>>>>>> I don't see any problem in the results you got >>>>>>> >>>>>>> SELECT * FROM escape WHERE val LIKE '%%esc%'; --> Give all results >>>>>>> containing '%esc' so %escapeme is a possible match and also escape%esc >>>>>> Why ‘containing’? I expect that it should be ’starting’.. >>>>>>> >>>>>>> SELECT * FROM escape WHERE val LIKE 'escape%%' --> Give all results >>>>>>> starting with 'escape%' so escape%me is a valid result and also >>>>>>> escape%esc >>>>>> Why ’starting’? I expect that it should be ‘exact matching’. >>>>>> >>>>>>> >>>>>>> On Tue, Sep 13, 2016 at 5:58 PM, Mikhail Krupitskiy >>>>>>> <mikhail.krupits...@jetbrains.com >>>>>>> <mailto:mikhail.krupits...@jetbrains.com>> wrote: >>>>>>> Thanks for the reply. >>>>>>> Could you please provide what index definition did you use? >>>>>>> With the index from my script I get the following results: >>>>>>> >>>>>>> cqlsh:test> select * from escape; >>>>>>> >>>>>>> id | val >>>>>>> ----+----------- >>>>>>> 1 | %escapeme >>>>>>> 2 | escape%me >>>>>>> 3 | escape%esc >>>>>>> >>>>>>> Contains search >>>>>>> >>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%'; >>>>>>> >>>>>>> id | val >>>>>>> ----+----------- >>>>>>> 1 | %escapeme >>>>>>> 3 | escape%esc >>>>>>> (2 rows) >>>>>>> >>>>>>> >>>>>>> Prefix search >>>>>>> >>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%'; >>>>>>> >>>>>>> id | val >>>>>>> ----+----------- >>>>>>> 2 | escape%me >>>>>>> 3 | escape%esc >>>>>>> >>>>>>> Thanks, >>>>>>> Mikhail >>>>>>> >>>>>>>> On 13 Sep 2016, at 18:16, DuyHai Doan <doanduy...@gmail.com >>>>>>>> <mailto:doanduy...@gmail.com>> wrote: >>>>>>>> >>>>>>>> Use % to escape % >>>>>>>> >>>>>>>> cqlsh:test> select * from escape; >>>>>>>> >>>>>>>> id | val >>>>>>>> ----+----------- >>>>>>>> 1 | %escapeme >>>>>>>> 2 | escape%me >>>>>>>> >>>>>>>> >>>>>>>> Contains search >>>>>>>> >>>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE '%%esc%'; >>>>>>>> >>>>>>>> id | val >>>>>>>> ----+----------- >>>>>>>> 1 | %escapeme >>>>>>>> >>>>>>>> (1 rows) >>>>>>>> >>>>>>>> >>>>>>>> Prefix search >>>>>>>> >>>>>>>> cqlsh:test> SELECT * FROM escape WHERE val LIKE 'escape%%'; >>>>>>>> >>>>>>>> id | val >>>>>>>> ----+----------- >>>>>>>> 2 | escape%me >>>>>>>> >>>>>>>> On Tue, Sep 13, 2016 at 5:06 PM, Mikhail Krupitskiy >>>>>>>> <mikhail.krupits...@jetbrains.com >>>>>>>> <mailto:mikhail.krupits...@jetbrains.com>> wrote: >>>>>>>> Hi Cassandra guys, >>>>>>>> >>>>>>>> I use Cassandra 3.7 and wondering how to use ‘%’ as a simple char in a >>>>>>>> search pattern. >>>>>>>> Here is my test script: >>>>>>>> >>>>>>>> DROP keyspace if exists kmv; >>>>>>>> CREATE keyspace if not exists kmv WITH REPLICATION = { 'class' : >>>>>>>> 'SimpleStrategy', 'replication_factor':'1'} ; >>>>>>>> USE kmv; >>>>>>>> CREATE TABLE if not exists kmv (id int, c1 text, c2 text, PRIMARY >>>>>>>> KEY(id, c1)); >>>>>>>> CREATE CUSTOM INDEX ON kmv.kmv ( c2 ) USING >>>>>>>> 'org.apache.cassandra.index.sa >>>>>>>> <http://org.apache.cassandra.index.sa/>si.SASIIndex' WITH OPTIONS = { >>>>>>>> 'analyzed' : 'true', >>>>>>>> 'analyzer_class' : 'org.apache.cassandra.index.sa >>>>>>>> <http://org.apache.cassandra.index.sa/>si.analyzer.NonTokenizingAnalyzer', >>>>>>>> 'case_sensitive' : 'false', >>>>>>>> 'mode' : 'CONTAINS' >>>>>>>> }; >>>>>>>> >>>>>>>> INSERT into kmv (id, c1, c2) values (1, 'f22', 'qwe%asd'); >>>>>>>> INSERT into kmv (id, c1, c2) values (2, 'f22', '%asd'); >>>>>>>> INSERT into kmv (id, c1, c2) values (3, 'f22', 'asd%'); >>>>>>>> INSERT into kmv (id, c1, c2) values (4, 'f22', 'asd%1'); >>>>>>>> INSERT into kmv (id, c1, c2) values (5, 'f22', 'qweasd'); >>>>>>>> >>>>>>>> SELECT c2 from kmv.kmv where c2 like ‘_pattern_'; >>>>>>>> >>>>>>>> _pattern_ '%%%' finds all columns that contain %. >>>>>>>> How to find columns that start form ‘%’ or ‘%a’? >>>>>>>> How to find columns that end with ‘%’? >>>>>>>> What about more complex patterns: '%qwe%a%sd%’? How to differentiate >>>>>>>> ‘%’ char form % as a command symbol? (Also there is a related issue >>>>>>>> CASSANDRA-12573). >>>>>>>> >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Mikhail >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>> >>> >> >> > >