[ https://issues.apache.org/jira/browse/CASSANDRA-11122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15134183#comment-15134183 ]
Patrick GUILLEBERT commented on CASSANDRA-11122: ------------------------------------------------ Very strange. I did a fresh build and couldn't reproduce it. After deleting all data and restarting, I could reproduce the bug to. > SASI does not find term when indexing non-ascii character > --------------------------------------------------------- > > Key: CASSANDRA-11122 > URL: https://issues.apache.org/jira/browse/CASSANDRA-11122 > Project: Cassandra > Issue Type: Bug > Components: CQL > Environment: Cassandra 3.4 SNAPSHOT > Reporter: DOAN DuyHai > Attachments: CASSANDRA-11122.patch > > > I built the snapshot version taken from here: > https://github.com/xedin/cassandra/tree/CASSANDRA-11067 > I create a tiny musical dataset with non-ascii characters (*cyrillic* > actually) and create a SASI index on the artist name. > SASI can find rows for the cyrillic name but strangely fails to index normal > ascii name (_'Object'_). > {code:sql} > CREATE KEYSPACE music WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': '1'} AND durable_writes = true; > CREATE TABLE music.albums ( > title text PRIMARY KEY, > artist text > ); > INSERT INTO music.albums(artist,title) VALUES('Object','The Reflecting Skin'); > INSERT INTO music.albums(artist,title) VALUES('Hayden','Mild and Hazy'); > INSERT INTO music.albums(artist,title) VALUES('Самое Большое Простое > Число','СБПЧ Оркестр'); > CREATE custom INDEX on music.albums(artist) USING > 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { > 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', > 'case_sensitive': 'false'}; > SELECT * FROM music.albums; > title | artist > ---------------------+----------------------------- > The Reflecting Skin | Object > Mild and Hazy | Hayden > СБПЧ Оркестр | Самое Большое Простое Число > (3 rows) > SELECT * FROM music.albums WHERE artist='Самое Большое Простое Число'; > title | artist > ---------------------+----------------------------- > СБПЧ Оркестр | Самое Большое Простое Число > (1 rows) > SELECT * FROM music.albums WHERE artist='Hayden'; > title | artist > ---------------------+----------------------------- > Mild and Hazy | Hayden > (1 rows) > SELECT * FROM music.albums WHERE artist='Object'; > title | artist > ---------------------+----------------------------- > (0 rows) > SELECT * FROM music.albums WHERE artist like 'Ob%'; > title | artist > ---------------------+----------------------------- > (0 rows) > {code} > Strangely enough, after cleaning all the data and re-inserting without the > russian artist with cyrillic name, SASI does find _'Object_' ... > {code:sql} > DROP INDEX albums_artist_idx; > TRUNCATE TABLE albums; > INSERT INTO albums(artist,title) VALUES('Object','The Reflecting Skin'); > INSERT INTO albums(artist,title) VALUES('Hayden','Mild and Hazy'); > CREATE custom INDEX on music.albums(artist) USING > 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { > 'analyzer_class': > 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer', > 'case_sensitive': 'false'}; > SELECT * FROM music.albums; > title | artist > ---------------------+----------------------------- > The Reflecting Skin | Object > Mild and Hazy | Hayden > (2 rows) > SELECT * FROM music.albums WHERE artist='Object'; > title | artist > ---------------------+----------------------------- > The Reflecting Skin | Object > (1 rows) > SELECT * FROM music.albums WHERE artist LIKE 'Ob%'; > title | artist > ---------------------+----------------------------- > The Reflecting Skin | Object > (1 rows) > {code} > The behaviour is quite inconsistent. I can understand that SASI refuses to > index cyrillic character or issue exception when encountering non-ascii > characters (because we did not specify the locale) but it's very surprising > that the indexing fails for normal ascii characters like _Object_ > Could it be that SASI start indexing the artist name by following table > albums token range order (hash of title) and it stops indexing after > encountering the cyrillic name ? -- This message was sent by Atlassian JIRA (v6.3.4#6332)