[jira] [Updated] (CASSANDRA-13379) SASI index returns duplicate rows

Igor Novgorodov (JIRA) Mon, 27 Mar 2017 00:53:02 -0700

     [ 
https://issues.apache.org/jira/browse/CASSANDRA-13379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Igor Novgorodov updated CASSANDRA-13379:
----------------------------------------
    Description: 
{code}
CREATE TABLE bulks_recipients (
    bulk_id uuid,
    recipient text,
    bulk_id_idx uuid,
    PRIMARY KEY ((bulk_id, recipient))
)
{code}

*bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on 
partition key component at all for some reason.

{code}
CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients (bulk_id_idx) 
USING 'org.apache.cassandra.index.sasi.SASIIndex';
{code}

Then i insert 1 million rows with the same *bulk_id* and different *recipient*. 
Then 

{code}
> select count(*) from bulks_recipients ;

 count
---------
 1000000

(1 rows)
{code}

Ok, it's fine here. Now let's query by SASI:
{code}
> select count(*) from bulks_recipients where bulk_id_idx = 
> fedd95ec-2cc8-4040-8619-baf69647700b;

 count
---------
 1010101

(1 rows)
{code}
Hmm, very strange count - 10101 extra rows.
Ok, i've dumped the query result into a text file:
{code}
# cat sasi.txt | wc -l
1000200
{code}
Here we have 200 extra rows for some reason.

Let's check if these are duplicates:
{code}
# cat sasi.txt | sort | uniq | wc -l
1000000
{code}
Yep, looks like.

Recreating index does not help. If i drop the index and issue the very same 
query (against partition key *bulk_id*, not *bulk_id_idx*) - i get correct 
results.

  was:
{code}
CREATE TABLE bulks_recipients (
    bulk_id uuid,
    recipient text,
    bulk_id_idx uuid,
    status int,
    ts timestamp,
    PRIMARY KEY ((bulk_id, recipient))
)
{code}

*bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on 
partition key component at all for some reason.

{code}
CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients (bulk_id_idx) 
USING 'org.apache.cassandra.index.sasi.SASIIndex';
{code}

Then i insert 1 million rows with the same *bulk_id* and different *recipient*. 
Then 

{code}
> select count(*) from bulks_recipients ;

 count
---------
 1000000

(1 rows)
{code}

Ok, it's fine here. Now let's query by SASI:
{code}
> select count(*) from bulks_recipients where bulk_id_idx = 
> fedd95ec-2cc8-4040-8619-baf69647700b;

 count
---------
 1010101

(1 rows)
{code}
Hmm, very strange count - 10101 extra rows.
Ok, i've dumped the query result into a text file:
{code}
# cat sasi.txt | wc -l
1000200
{code}
Here we have 200 extra rows for some reason.

Let's check if these are duplicates:
{code}
# cat sasi.txt | sort | uniq | wc -l
1000000
{code}
Yep, looks like.

Recreating index does not help.



> SASI index returns duplicate rows
> ---------------------------------
>
>                 Key: CASSANDRA-13379
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-13379
>             Project: Cassandra
>          Issue Type: Bug
>          Components: sasi
>            Reporter: Igor Novgorodov
>
> {code}
> CREATE TABLE bulks_recipients (
>     bulk_id uuid,
>     recipient text,
>     bulk_id_idx uuid,
>     PRIMARY KEY ((bulk_id, recipient))
> )
> {code}
> *bulk_id_idx* is just a copy of *bulk_id* because SASI does not work on 
> partition key component at all for some reason.
> {code}
> CREATE CUSTOM INDEX bulks_recipients_bulk_id ON bulks_recipients 
> (bulk_id_idx) USING 'org.apache.cassandra.index.sasi.SASIIndex';
> {code}
> Then i insert 1 million rows with the same *bulk_id* and different 
> *recipient*. Then 
> {code}
> > select count(*) from bulks_recipients ;
>  count
> ---------
>  1000000
> (1 rows)
> {code}
> Ok, it's fine here. Now let's query by SASI:
> {code}
> > select count(*) from bulks_recipients where bulk_id_idx = 
> > fedd95ec-2cc8-4040-8619-baf69647700b;
>  count
> ---------
>  1010101
> (1 rows)
> {code}
> Hmm, very strange count - 10101 extra rows.
> Ok, i've dumped the query result into a text file:
> {code}
> # cat sasi.txt | wc -l
> 1000200
> {code}
> Here we have 200 extra rows for some reason.
> Let's check if these are duplicates:
> {code}
> # cat sasi.txt | sort | uniq | wc -l
> 1000000
> {code}
> Yep, looks like.
> Recreating index does not help. If i drop the index and issue the very same 
> query (against partition key *bulk_id*, not *bulk_id_idx*) - i get correct 
> results.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Updated] (CASSANDRA-13379) SASI index returns duplicate rows

Reply via email to