[jira] [Commented] (CASSANDRA-14247) SASI tokenizer for simple delimiter based entries

Michael Kjellman (JIRA) Tue, 27 Feb 2018 17:50:09 -0800

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-14247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379642#comment-16379642
 ]


Michael Kjellman commented on CASSANDRA-14247:
----------------------------------------------

1) I think it would be better if we used a "," or " " for the default delimiter

2) I think it would be better if we do the work inside the iterator itself vs. 
using the split() function on the entire contents of the string in reset(). If 
we can do it iteratively we can then potentially reuse buffers and just go 
character by character until we hit the delimiter vs. needing to process the 
whole thing, no? Or did you benchmark this and find even with potentially large 
strings there wasn't a win?

3) When you hit a MarshalException you're logging the whole thing.. if the 
value is a 30MB text blob – the logger would get slammed so not sure logging 
the entire thing by default is ideal. thoughts?

> SASI tokenizer for simple delimiter based entries
> -------------------------------------------------
>
>                 Key: CASSANDRA-14247
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14247
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: sasi
>            Reporter: mck
>            Assignee: mck
>            Priority: Major
>             Fix For: 4.0, 3.11.x
>
>
> Currently SASI offers only two tokenizer options:
>  - NonTokenizerAnalyser
>  - StandardAnalyzer
> The latter is built upon Snowball, powerful for human languages but overkill 
> for simple tokenization.
> A simple tokenizer is proposed here. The need for this arose as a workaround 
> of CASSANDRA-11182, and to avoid the disk usage explosion when having to 
> resort to {{CONTAINS}}. See https://github.com/openzipkin/zipkin/issues/1861
> Example use of this would be:
> {code}
> CREATE CUSTOM INDEX span_annotation_query_idx 
>     ON zipkin2.span (annotation_query) USING 
> 'org.apache.cassandra.index.sasi.SASIIndex' 
>     WITH OPTIONS = {
>         'analyzer_class': 
> 'org.apache.cassandra.index.sasi.analyzer.DelimiterAnalyzer', 
>         'delimiter': '░',
>         'case_sensitive': 'true', 
>         'mode': 'prefix', 
>         'analyzed': 'true'};
> {code}
> Original credit for this work goes to https://github.com/zuochangan



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Commented] (CASSANDRA-14247) SASI tokenizer for simple delimiter based entries

Reply via email to