[ https://issues.apache.org/jira/browse/CASSANDRA-14247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379642#comment-16379642 ]
Michael Kjellman commented on CASSANDRA-14247: ---------------------------------------------- 1) I think it would be better if we used a "," or " " for the default delimiter 2) I think it would be better if we do the work inside the iterator itself vs. using the split() function on the entire contents of the string in reset(). If we can do it iteratively we can then potentially reuse buffers and just go character by character until we hit the delimiter vs. needing to process the whole thing, no? Or did you benchmark this and find even with potentially large strings there wasn't a win? 3) When you hit a MarshalException you're logging the whole thing.. if the value is a 30MB text blob – the logger would get slammed so not sure logging the entire thing by default is ideal. thoughts? > SASI tokenizer for simple delimiter based entries > ------------------------------------------------- > > Key: CASSANDRA-14247 > URL: https://issues.apache.org/jira/browse/CASSANDRA-14247 > Project: Cassandra > Issue Type: Improvement > Components: sasi > Reporter: mck > Assignee: mck > Priority: Major > Fix For: 4.0, 3.11.x > > > Currently SASI offers only two tokenizer options: > - NonTokenizerAnalyser > - StandardAnalyzer > The latter is built upon Snowball, powerful for human languages but overkill > for simple tokenization. > A simple tokenizer is proposed here. The need for this arose as a workaround > of CASSANDRA-11182, and to avoid the disk usage explosion when having to > resort to {{CONTAINS}}. See https://github.com/openzipkin/zipkin/issues/1861 > Example use of this would be: > {code} > CREATE CUSTOM INDEX span_annotation_query_idx > ON zipkin2.span (annotation_query) USING > 'org.apache.cassandra.index.sasi.SASIIndex' > WITH OPTIONS = { > 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', > 'delimiter': '░', > 'case_sensitive': 'true', > 'mode': 'prefix', > 'analyzed': 'true'}; > {code} > Original credit for this work goes to https://github.com/zuochangan -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org