Tyler is correct, because Cassandra doesn't wait until repair writes
are acked before the answer is returned. This is something we can fix.

On Sun, Apr 17, 2011 at 12:05 AM, Sean Bridges <sean.brid...@gmail.com> wrote:
> Tyler, your answer seems to contradict this email by Jonathan Ellis
> [1].  In it Jonathan says,
>
> "The important guarantee this gives you is that once one quorum read
> sees the new value, all others will too.   You can't see the newest
> version, then see an older version on a subsequent write [sic, I
> assume he meant read], which is the characteristic of non-strong
> consistency"
>
> Jonathan also says,
>
> "{X, Y} and {X, Z} are equivalent: one node with the write, and one
> without. The read will recognize that X's version needs to be sent to
> Z, and the write will be complete.  This read and all subsequent ones
> will see the write.  (Z [sic, I assume he meant Y] will be replicated
> to asynchronously via read repair.)"
>
> To me, the statement "this read and all subsequent ones will see the
> write" implies that the new value must be committed to Y or Z before
> the read can return.  If not, the statement must be false.
>
> Sean
>
>
> [1] : 
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201102.mbox/%3caanlktimegp8h87mgs_bxzknck-a59whxf-xx58hca...@mail.gmail.com%3E
>
> Sean
>
> On Sat, Apr 16, 2011 at 7:44 PM, Tyler Hobbs <ty...@datastax.com> wrote:
>> Here's what's probably happening:
>>
>> I'm assuming RF=3 and QUORUM writes/reads here.  I'll call the replicas A,
>> B, and C.
>>
>> 1.  Writer process writes sequence number 1 and everything works fine.  A,
>> B, and C all have sequence number 1.
>> 2.  Writer process writes sequence number 2.  Replica A writes successfully,
>> B and C fail to respond in time, and a TimedOutException is returned.
>> pycassa waits to retry the operation.
>> 3.  Reader process reads, gets a response from A and B.  When the row from A
>> and B is merged, sequence number 2 is the newest and is returned.  A read
>> repair is pushed to B and C, but they don't yet update their data.
>> 4.  Reader process reads again, gets a response from B and C (before they've
>> repaired).  These both report sequence number 1, so that's returned to the
>> client.  This is were you get a decreasing sequence number.
>> 5.  pycassa eventually retries the write; B and C eventually repair their
>> data.  Either way, both B and C shortly have sequence number 2.
>>
>> I've left out some of the details of read repair, and this scenario could
>> happen in several slightly different ways, but it should give you an idea of
>> what's happening.
>>
>> On Sat, Apr 16, 2011 at 8:35 PM, James Cipar <jci...@cmu.edu> wrote:
>>>
>>> Here it is.  There is some setup code and global variable definitions that
>>> I left out of the previous code, but they are pretty similar to the setup
>>> code here.
>>>     import pycassa
>>>     import random
>>>     import time
>>>     consistency_level = pycassa.cassandra.ttypes.ConsistencyLevel.QUORUM
>>>     duration = 600
>>>     sleeptime = 0.0
>>>     hostlist = 'worker-hostlist'
>>>     def read_servers(fn):
>>>         f = open(fn)
>>>         servers = []
>>>         for line in f:
>>>             servers.append(line.strip())
>>>         f.close()
>>>         return servers
>>>     servers = read_servers(hostlist)
>>>     start_time = time.time()
>>>     seqnum = -1
>>>     timestamp = 0
>>>     while time.time() < start_time + duration:
>>>         target_server = random.sample(servers, 1)[0]
>>>         target_server = '%s:9160'%target_server
>>>         try:
>>>             pool = pycassa.connect('Keyspace1', [target_server])
>>>             cf = pycassa.ColumnFamily(pool, 'Standard1')
>>>             row = cf.get('foo', read_consistency_level=consistency_level)
>>>             pool.dispose()
>>>         except:
>>>             time.sleep(sleeptime)
>>>             continue
>>>         sq = int(row['seqnum'])
>>>         ts = float(row['timestamp'])
>>>         if sq < seqnum:
>>>             print 'Row changed: %i %f -> %i %f'%(seqnum, timestamp, sq,
>>> ts)
>>>         seqnum = sq
>>>         timestamp = ts
>>>         if sleeptime > 0.0:
>>>             time.sleep(sleeptime)
>>>
>>>
>>>
>>> On Apr 16, 2011, at 5:20 PM, Tyler Hobbs wrote:
>>>
>>> James,
>>>
>>> Would you mind sharing your reader process code as well?
>>>
>>> On Fri, Apr 15, 2011 at 1:14 PM, James Cipar <jci...@cmu.edu> wrote:
>>>>
>>>> I've been experimenting with the consistency model of Cassandra, and I
>>>> found something that seems a bit unexpected.  In my experiment, I have 2
>>>> processes, a reader and a writer, each accessing a Cassandra cluster with a
>>>> replication factor greater than 1.  In addition, sometimes I generate
>>>> background traffic to simulate a busy cluster by uploading a large data 
>>>> file
>>>> to another table.
>>>>
>>>> The writer executes a loop where it writes a single row that contains
>>>> just an sequentially increasing sequence number and a timestamp.  In python
>>>> this looks something like:
>>>>
>>>>    while time.time() < start_time + duration:
>>>>        target_server = random.sample(servers, 1)[0]
>>>>        target_server = '%s:9160'%target_server
>>>>
>>>>        row = {'seqnum':str(seqnum), 'timestamp':str(time.time())}
>>>>        seqnum += 1
>>>>        # print 'uploading to server %s, %s'%(target_server, row)
>>>>
>>>>        pool = pycassa.connect('Keyspace1', [target_server])
>>>>        cf = pycassa.ColumnFamily(pool, 'Standard1')
>>>>        cf.insert('foo', row, write_consistency_level=consistency_level)
>>>>        pool.dispose()
>>>>
>>>>        if sleeptime > 0.0:
>>>>            time.sleep(sleeptime)
>>>>
>>>>
>>>> The reader simply executes a loop reading this row and reporting whenever
>>>> a sequence number is *less* than the previous sequence number.  As 
>>>> expected,
>>>> with consistency_level=ConsistencyLevel.ONE there are many inconsistencies,
>>>> especially with a high replication factor.
>>>>
>>>> What is unexpected is that I still detect inconsistencies when it is set
>>>> at ConsistencyLevel.QUORUM.  This is unexpected because the documentation
>>>> seems to imply that QUORUM will give consistent results.  With background
>>>> traffic the average difference in timestamps was 0.6s, and the maximum was
>>>> >3.5s.  This means that a client sees a version of the row, and can
>>>> subsequently see another version of the row that is 3.5s older than the
>>>> previous.
>>>>
>>>> What I imagine is happening is this, but I'd like someone who knows that
>>>> they're talking about to tell me if it's actually the case:
>>>>
>>>> I think Cassandra is not using an atomic commit protocol to commit to the
>>>> quorum of servers chosen when the write is made.  This means that at some
>>>> point in the middle of the write, some subset of the quorum have seen the
>>>> write, while others have not.  At this time, there is a quorum of servers
>>>> that have not seen the update, so depending on which quorum the client 
>>>> reads
>>>> from, it may or may not see the update.
>>>>
>>>> Of course, I understand that the client is not *choosing* a bad quorum to
>>>> read from, it is just the first `q` servers to respond, but in this case it
>>>> is effectively random and sometimes an bad quorum is "chosen".
>>>>
>>>> Does anyone have any other insight into what is going on here?
>>>
>>>
>>> --
>>> Tyler Hobbs
>>> Software Engineer, DataStax
>>> Maintainer of the pycassa Cassandra Python client library
>>>
>>>
>>
>>
>>
>> --
>> Tyler Hobbs
>> Software Engineer, DataStax
>> Maintainer of the pycassa Cassandra Python client library
>>
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Reply via email to