[ https://issues.apache.org/jira/browse/CASSANDRA-13379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Igor Novgorodov updated CASSANDRA-13379: ---------------------------------------- Description: {code} CREATE TABLE bulks_recipients ( bulk_id uuid, recipient text, bulk_id_idx uuid, PRIMARY KEY ((bulk_id, recipient)) ) {code} *bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on partition key component at all for some reason. {code} CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex'; {code} Then i insert 1 million rows with the same *bulk_id* and different *recipient*. Then {code} > select count(*) from bulks_recipients ; count --------- 1000000 (1 rows) {code} Ok, it's fine here. Now let's query by SASI: {code} > select count(*) from bulks_recipients where bulk_id_idx = > fedd95ec-2cc8-4040-8619-baf69647700b; count --------- 1010101 (1 rows) {code} Hmm, very strange count - 10101 extra rows. Ok, i've dumped the query result into a text file: {code} # cat sasi.txt | wc -l 1000200 {code} Here we have 200 extra rows for some reason. Let's check if these are duplicates: {code} # cat sasi.txt | sort | uniq | wc -l 1000000 {code} Yep, looks like. Recreating index does not help. If i drop the index and issue the very same query (against partition key *bulk_id*, not *bulk_id_idx*) - i get correct results. was: {code} CREATE TABLE bulks_recipients ( bulk_id uuid, recipient text, bulk_id_idx uuid, status int, ts timestamp, PRIMARY KEY ((bulk_id, recipient)) ) {code} *bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on partition key component at all for some reason. {code} CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex'; {code} Then i insert 1 million rows with the same *bulk_id* and different *recipient*. Then {code} > select count(*) from bulks_recipients ; count --------- 1000000 (1 rows) {code} Ok, it's fine here. Now let's query by SASI: {code} > select count(*) from bulks_recipients where bulk_id_idx = > fedd95ec-2cc8-4040-8619-baf69647700b; count --------- 1010101 (1 rows) {code} Hmm, very strange count - 10101 extra rows. Ok, i've dumped the query result into a text file: {code} # cat sasi.txt | wc -l 1000200 {code} Here we have 200 extra rows for some reason. Let's check if these are duplicates: {code} # cat sasi.txt | sort | uniq | wc -l 1000000 {code} Yep, looks like. Recreating index does not help. > SASI index returns duplicate rows > --------------------------------- > > Key: CASSANDRA-13379 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13379 > Project: Cassandra > Issue Type: Bug > Components: sasi > Reporter: Igor Novgorodov > > {code} > CREATE TABLE bulks_recipients ( > bulk_id uuid, > recipient text, > bulk_id_idx uuid, > PRIMARY KEY ((bulk_id, recipient)) > ) > {code} > *bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on > partition key component at all for some reason. > {code} > CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients > (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex'; > {code} > Then i insert 1 million rows with the same *bulk_id* and different > *recipient*. Then > {code} > > select count(*) from bulks_recipients ; > count > --------- > 1000000 > (1 rows) > {code} > Ok, it's fine here. Now let's query by SASI: > {code} > > select count(*) from bulks_recipients where bulk_id_idx = > > fedd95ec-2cc8-4040-8619-baf69647700b; > count > --------- > 1010101 > (1 rows) > {code} > Hmm, very strange count - 10101 extra rows. > Ok, i've dumped the query result into a text file: > {code} > # cat sasi.txt | wc -l > 1000200 > {code} > Here we have 200 extra rows for some reason. > Let's check if these are duplicates: > {code} > # cat sasi.txt | sort | uniq | wc -l > 1000000 > {code} > Yep, looks like. > Recreating index does not help. If i drop the index and issue the very same > query (against partition key *bulk_id*, not *bulk_id_idx*) - i get correct > results. -- This message was sent by Atlassian JIRA (v6.3.15#6346)