[ https://issues.apache.org/jira/browse/CASSANDRA-17667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17571175#comment-17571175 ]
Brad Schoening edited comment on CASSANDRA-17667 at 7/26/22 3:52 AM: --------------------------------------------------------------------- [~ahomoki], looking at this a little further, safescanner.py, which inherits from the python re.Scanner, is tokenizing the input. An example from [https://mail.python.org/pipermail/python-dev/2003-April/035075.html] shows how input is parsed: {code:java} import re def s_ident(scanner, token): return token def s_operator(scanner, token): return "op%s" % token def s_float(scanner, token): return float(token) def s_int(scanner, token): return int(token) scanner = re.Scanner([ (r"[a-zA-Z_]\w*", s_ident), (r"\d+\.\d*", s_float), (r"\d+", s_int), (r"=|\+|-|\*|/", s_operator), (r"\s+", None), ]) # sanity check test('scanner.scan("sum = 3*foo + 312.50 + bar")', (['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 'op+', 'bar'], '')) {code} In pylexotron this is implemented as: {code:java} RuleSpecScanner = SaferScanner([ (r'::=', lambda s, t: t), (r'\[[a-z0-9_]+\]=', lambda s, t: ('named_collector', t[1:-2])), (r'[a-z0-9_]+=', lambda s, t: ('named_symbol', t[:-1])), (r'/(\[\^?.[^]]*\]|[^/]|\\.)*/', lambda s, t: ('regex', t[1:-1].replace(r'\/', '/'))), (r'"([^"]|\\.)*"', lambda s, t: ('litstring', t)), (r'<[^>]*>', lambda s, t: ('reference', t[1:-1])), (r'\bJUNK\b', lambda s, t: ('junk', t)), (r'[@()|?*;]', lambda s, t: t), (r'\s+', None), (r'#[^\n]*', None), ], re.I | re.S | re.U) {code} r'\s+' is skipping whitespace I'm uncertain what r'#[^\n]*' and r'\bJUNK\b' are doing. Adding comments could be helpful. The Scanner flags used are re.I | re.S | re.U for IgnoreCase, DotAll and Unicode, but it doesn't use re.M for Multiline. So, either that could be added if it doesn't break anything, or the tokenizer would have to tokenize a start comment and end comment token. There doesn't seem to be unit tests for pylexotron or SafeScanner, however. That might be a good thing to add. There could be tests for each type of token, named_collector, named_symbol, regex, litstring, reference and junk. was (Author: bschoeni): [~ahomoki], looking at this a little further, safescanner.py, which inherits from the python re.Scanner, is tokenizing the input. An example from [https://mail.python.org/pipermail/python-dev/2003-April/035075.html] shows input is parsed: {code:java} import re def s_ident(scanner, token): return token def s_operator(scanner, token): return "op%s" % token def s_float(scanner, token): return float(token) def s_int(scanner, token): return int(token) scanner = re.Scanner([ (r"[a-zA-Z_]\w*", s_ident), (r"\d+\.\d*", s_float), (r"\d+", s_int), (r"=|\+|-|\*|/", s_operator), (r"\s+", None), ]) # sanity check test('scanner.scan("sum = 3*foo + 312.50 + bar")', (['sum', 'op=', 3, 'op*', 'foo', 'op+', 312.5, 'op+', 'bar'], '')) {code} In pylexotron this is implemented as: {code:java} RuleSpecScanner = SaferScanner([ (r'::=', lambda s, t: t), (r'\[[a-z0-9_]+\]=', lambda s, t: ('named_collector', t[1:-2])), (r'[a-z0-9_]+=', lambda s, t: ('named_symbol', t[:-1])), (r'/(\[\^?.[^]]*\]|[^/]|\\.)*/', lambda s, t: ('regex', t[1:-1].replace(r'\/', '/'))), (r'"([^"]|\\.)*"', lambda s, t: ('litstring', t)), (r'<[^>]*>', lambda s, t: ('reference', t[1:-1])), (r'\bJUNK\b', lambda s, t: ('junk', t)), (r'[@()|?*;]', lambda s, t: t), (r'\s+', None), (r'#[^\n]*', None), ], re.I | re.S | re.U) {code} r'\s+' is skipping whitespace I'm uncertain what r'#[^\n]*' and r'\bJUNK\b' are doing. Adding comments could be helpful. The Scanner flags used are re.I | re.S | re.U for IgnoreCase, DotAll and Unicode, but it doesn't use re.M for Multiline. So, either that could be added if it doesn't break anything, or the tokenizer would have to tokenize a start comment and end comment token. There doesn't seem to be unit tests for pylexotron or SafeScanner, however. That might be a good thing to add. There could be tests for each type of token, named_collector, named_symbol, regex, litstring, reference and junk. > Text value containing "/*" interpreted as multiline comment in cqlsh > -------------------------------------------------------------------- > > Key: CASSANDRA-17667 > URL: https://issues.apache.org/jira/browse/CASSANDRA-17667 > Project: Cassandra > Issue Type: Bug > Components: CQL/Interpreter > Reporter: ANOOP THOMAS > Assignee: Attila Homoki > Priority: Normal > Fix For: 4.0.x, 4.1.x > > > I use CQLSH command line utility to load some DDLs. The version of utility I > use is this: > {noformat} > [cqlsh 6.0.0 | Cassandra 4.0.0.47 | CQL spec 3.4.5 | Native protocol > v5]{noformat} > Command that loads DDL.cql: > {noformat} > cqlsh -u username -p password cassandra.example.com 65503 --ssl -f DDL.cql > {noformat} > I have a line in CQL script that breaks the syntax. > {noformat} > INSERT into tablename (key,columnname1,columnname2) VALUES > ('keyName','value1','/value2/*/value3');{noformat} > {{/*}} here is interpreted as start of multi-line comment. It used to work on > older versions of cqlsh. The error I see looks like this: > {noformat} > SyntaxException: line 4:2 mismatched input 'Update' expecting ')' > (...,'value1','/value2INSERT into tablename(INSERT into tablename > (key,columnname1,columnname2)) VALUES ('[Update]-...) SyntaxException: line > 1:0 no viable alternative at input '(' ([(]...) > {noformat} > Same behavior while running in interactive mode too. {{/*}} inside a CQL > statement should not be interpreted as start of multi-line comment. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org