[jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows

Christophe Biocca (JIRA) Mon, 03 May 2010 09:39:18 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12863395#action_12863395
 ]


Christophe Biocca commented on CASSANDRA-1042:
----------------------------------------------

The basic issue is that the thrift server's return value is sorted by the 
absolute value of the tokens, while the CassandraRecordReader assumes that the 
order is the one given by traversal of the range (that is, we get the smallest 
value greater than start_token in first position, and the greatest value 
smaller than or equal to end_token in last position. 
Now I don't know which is correct, as the API docs I've looked at don't suggest 
which order is supposed to be returned, but if the server's implementation is 
correct, then the record reader needs to iterate over the returned tokens to 
figure out which one is actually the last token for iteration purposes. 
Otherwise, switching the server's implementation to return keys in the 
iteration order will work.

> ColumnFamilyRecordReader returns duplicate rows
> -----------------------------------------------
>
>                 Key: CASSANDRA-1042
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-1042
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.6
>            Reporter: Joost Ouwerkerk
>             Fix For: 0.6.2
>
>
> There's a bug in ColumnFamilyRecordReader that appears when processing a 
> single split (which happens in most tests that have small number of rows), 
> and potentially in other cases.  When the start and end tokens of the split 
> are equal, duplicate rows can be returned.
> Example with 5 rows:
> token (start and end) = 53193025635115934196771903670925341736
> Tokens returned by first get_range_slices iteration (all 5 rows):
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
>  99079589977253916124855502156832923443
>  144992942750327304334463589818972416113
>  166860289390734216023086131251507064403
> Tokens returned by next iteration (first token is last token from
> previous, end token is unchanged)
>  16955237001963240173058271559858726497
>  40670782773005619916245995581909898190
> Tokens returned by final iteration  (first token is last token from
> previous, end token is unchanged)
>  [] (empty)
> In this example, the mapper has processed 7 rows in total, 2 of which
> were duplicates.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (CASSANDRA-1042) ColumnFamilyRecordReader returns duplicate rows

Reply via email to