[ https://issues.apache.org/jira/browse/CASSANDRA-13379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15942849#comment-15942849 ]
Alex Petrov commented on CASSANDRA-13379: ----------------------------------------- Thank you for reporting. I suspect this is a paging issue. Could you please check if the duplicates are on the page boundaries? For that you could run {{paging 10}} in cqlsh and then try running your request again: {code} select * from bulks_recipients where bulk_id_idx = fedd95ec-2cc8-4040-8619-baf69647700b; {code} Then skipping through the pages. If the last entry of each page is the same as the first entry of the next page, that'd be it. Thank you for assistance. > SASI index returns duplicate rows > --------------------------------- > > Key: CASSANDRA-13379 > URL: https://issues.apache.org/jira/browse/CASSANDRA-13379 > Project: Cassandra > Issue Type: Bug > Components: sasi > Reporter: Igor Novgorodov > > {code} > CREATE TABLE bulks_recipients ( > bulk_id uuid, > recipient text, > bulk_id_idx uuid, > PRIMARY KEY ((bulk_id, recipient)) > ) > {code} > *bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on > partition key component at all for some reason. > {code} > CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients > (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex'; > {code} > Then i insert 1 million rows with the same *bulk_id* and different > *recipient*. Then > {code} > > select count(*) from bulks_recipients ; > count > --------- > 1000000 > (1 rows) > {code} > Ok, it's fine here. Now let's query by SASI: > {code} > > select count(*) from bulks_recipients where bulk_id_idx = > > fedd95ec-2cc8-4040-8619-baf69647700b; > count > --------- > 1010101 > (1 rows) > {code} > Hmm, very strange count - 10101 extra rows. > Ok, i've dumped the query result into a text file: > {code} > # cat sasi.txt | wc -l > 1000200 > {code} > Here we have 200 extra rows for some reason. > Let's check if these are duplicates: > {code} > # cat sasi.txt | sort | uniq | wc -l > 1000000 > {code} > Yep, looks like. > Recreating index does not help. If i issue the very same query (against > partition key *bulk_id*, not *bulk_id_idx*) - i get correct results. -- This message was sent by Atlassian JIRA (v6.3.15#6346)