Hi John,

On 12 December 2011 19:35, John Laban <j...@pagerduty.com> wrote:
>
> So I responded to your algorithm in another part of this thread (very
> interesting) but this part of the paper caught my attention:
>
> > When client application code releases a lock, that lock must not
> actually be
> > released for a period equal to one millisecond plus twice the maximum
> possible
> > drift of the clocks in the client computers accessing the Cassandra
> databases
>
> I've been worried about this, and added some arbitrary delay in the
> releasing of my locks.  But I don't like it as it's (A) an arbitrary value
> and (B) it will - perhaps greatly - reduce the throughput of the more
> high-contention areas of my system.
>
> To fix (B) I'll probably just have to try to get rid of locks all together
> in these high-contention areas.
>
> To fix (A), I'd need to know what the maximum possible drift of my clocks
> will be.  How did you determine this?  What value do you use, out of
> curiosity?  What does the network layout of your client machines look like?
>  (Are any of your hosts geographically separated or all running in the same
> DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
> clock skew on an ongoing basis?  Am I worrying too much?
>

If you setup NTP carefully no machine should drift more than 4ms say. I
forget where, but you'll find the best documentation on how to make a
bullet-proof NTP setup on vendor sites for virtualization software (because
virtualization software can cause drift so NTP setup has to be just so)

What this means is that, for example, to be really safe when a thread
releases a lock you should wait say 9ms. Some points:-
-- since the sleep is performed before release, an isolated operation
should not be delayed at all
-- only a waiting thread or a thread requesting a lock immediately it is
released will be delayed, and no extra CPU or memory load is involved
-- in practice for the vast majority of "application layer" data operations
this restriction will have no effect on overall performance as experienced
by a user, because such operations nearly always read and write to data
with limited scope, for example the data of two users involved in some
transaction
-- the clocks issue does mean that you can't really serialize access to
more broadly shared data where more than 5 or 10 such requests are made a
second, say, but in reality even if the extra 9ms sleep on release wasn't
necessary, variability in database operation execution time (say under
load, or when something goes wrong) means trouble might occur serializing
with that level of contention

So in summary, although this drift thing seems bad at first, partly because
it is a new consideration, in practice it's no big deal so long as you look
after your clocks (and the main issue to watch out for is when application
nodes running on virtualization software, hypervisors et al have setup
issues that make their clocks drift under load, and it is a good idea to be
wary of that)

Best, Dominic


> Sorry for all the questions but I'm very concerned about this particular
> problem :)
>
> Thanks,
> John
>
>
> On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams <
> dwilli...@fightmymonster.com> wrote:
>
>> Hi guys, just thought I'd chip in...
>>
>> Fight My Monster is still using Cages, which is working fine, but...
>>
>> I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
>> main reasons:-
>>
>> 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
>> getting anywhere near to capacity and we do a *lot* of serialisation) at
>> some point it will be necessary to start hashing lock paths onto separate
>> ZooKeeper clusters, and I tend to believe that these days you should choose
>> platforms that handle sharding themselves (e.g. choose Cassandra rather
>> than MySQL)
>>
>> 2. Why have more components in your system when you can have less!!! KISS
>>
>> Recently I therefore tried to devise an algorithm which can be used to
>> add a distributed locking layer to clients such as Pelops, Hector, Pycassa
>> etc.
>>
>> There is a doc describing the algorithm, to which may be added an
>> appendix describing a protocol so that locking can be interoperable between
>> the clients. That could be extended to describe a protocol for
>> transactions. Word of warning this is a *beta* algorithm that has only been
>> seen by a select group so far, and therefore not even 100% sure it works
>> but there is a useful general discussion regarding serialization of
>> reads/writes so I include it anyway (and since this algorithm is going to
>> be out there now, if there's anyone out there who fancies doing a Z proof
>> or disproof, that would be fantastic).
>> http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf
>>
>> Final word on this re transactions: if/when transactions are added to
>> locking system in Pelops/Hector/Pycassa, Cassandra will provide better
>> performance than ZooKeeper for storing snapshots, especially as transaction
>> size increases
>>
>> Best, Dominic
>>
>> On 11 December 2011 01:53, Guy Incognito <dnd1...@gmail.com> wrote:
>>
>>>  you could try writing with the clock of the initial replay entry?
>>>
>>> On 06/12/2011 20:26, John Laban wrote:
>>>
>>> Ah, neat.  It is similar to what was proposed in (4) above with adding
>>> transactions to Cages, but instead of snapshotting the data to be rolled
>>> back (the "before" data), you snapshot the data to be replayed (the "after"
>>> data).  And then later, if you find that the transaction didn't complete,
>>> you just keep replaying the transaction until it takes.
>>>
>>>  The part I don't understand with this approach though:  how do you
>>> ensure that someone else didn't change the data between your initial failed
>>> transaction and the later replaying of the transaction?  You could get lost
>>> writes in that situation.
>>>
>>>  Dominic (in the Cages blog post) explained a workaround with that for
>>> his rollback proposal:  all subsequent readers or writers of that data
>>> would have to check for abandoned transactions and roll them back
>>> themselves before they could read the data.  I don't think this is possible
>>> with the XACT_LOG "replay" approach in these slides though, based on how
>>> the data is indexed (cassandra node token + timeUUID).
>>>
>>>
>>>  PS:  How are you liking Cages?
>>>
>>>
>>>
>>>
>>> 2011/12/6 Jérémy SEVELLEC <jsevel...@gmail.com>
>>>
>>>> Hi John,
>>>>
>>>>  I had exactly the same reflexions.
>>>>
>>>>  I'm using zookeeper and cage to lock et isolate.
>>>>
>>>>  but how to rollback?
>>>> It's impossible so try replay!
>>>>
>>>>  the idea is explained in this presentation
>>>> http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
>>>> from slide 24)
>>>>
>>>>  - insert your whole data into one column
>>>> - make the job
>>>> - remove (or expire) your column.
>>>>
>>>>  if there is a problem during "making the job", you keep the
>>>> possibility to replay and replay and replay (synchronously or in a batch).
>>>>
>>>>  Regards
>>>>
>>>>  Jérémy
>>>>
>>>>
>>>> 2011/12/5 John Laban <j...@pagerduty.com>
>>>>
>>>>> Hello,
>>>>>
>>>>>  I'm building a system using Cassandra as a datastore and I have a
>>>>> few places where I am need of transactions.
>>>>>
>>>>>  I'm using ZooKeeper to provide locking when I'm in need of some
>>>>> concurrency control or isolation, so that solves that half of the puzzle.
>>>>>
>>>>>  What I need now is to sometimes be able to get atomicity across
>>>>> multiple writes by simulating the "begin/rollback/commit" abilities of a
>>>>> relational DB.  In other words, there are places where I need to perform
>>>>> multiple updates/inserts, and if I fail partway through, I would ideally 
>>>>> be
>>>>> able to rollback the partially-applied updates.
>>>>>
>>>>>  Now, I *know* this isn't possible with Cassandra.  What I'm looking
>>>>> for are all the best practices, or at least tips and tricks, so that I can
>>>>> get around this limitation in Cassandra and still maintain a consistent
>>>>> datastore.  (I am using quorum reads/writes so that eventual consistency
>>>>> doesn't kick my ass here as well.)
>>>>>
>>>>>  Below are some ideas I've been able to dig up.  Please let me know
>>>>> if any of them don't make sense, or if there are better approaches:
>>>>>
>>>>>
>>>>>  1) Updates to a row in a column family are atomic.  So try to model
>>>>> your data so that you would only ever need to update a single row in a
>>>>> single CF at once.  Essentially, you model your data around transactions.
>>>>>  This is tricky but can certainly be done in some situations.
>>>>>
>>>>>  2) If you are only dealing with multiple row *inserts* (and not
>>>>> updates), have one of the rows act as a 'commit' by essentially validating
>>>>> the presence of the other rows.  For example, say you were performing an
>>>>> operation where you wanted to create an Account row and 5 User rows all at
>>>>> once (this is an unlikely example, but bear with me).  You could insert 5
>>>>> rows into the Users CF, and then the 1 row into the Accounts CF, which 
>>>>> acts
>>>>> as the commit.  If something went wrong before the Account could be
>>>>> created, any Users that had been created so far would be orphaned and
>>>>> unusable, as your business logic can ensure that they can't exist without
>>>>> an Account.  You could also have an offline cleanup process that swept 
>>>>> away
>>>>> orphans.
>>>>>
>>>>>  3) Try to model your updates as idempotent column inserts instead.
>>>>>  How do you model updates as inserts?  Instead of munging the value
>>>>> directly, you could insert a column containing the operation you want to
>>>>> perform (like "+5").  It would work kind of like the Consistent Vote
>>>>> Counting implementation: ( https://gist.github.com/416666 ).  How do
>>>>> you make the inserts idempotent?  Make sure the column names correspond to
>>>>> a request ID or some other identifier that would be identical across
>>>>> re-drives of a given (perhaps originally failed) request.  This could 
>>>>> leave
>>>>> your datastore in a temporarily inconsistent state, but would eventually
>>>>> become consistent after a successful re-drive of the original request.
>>>>>
>>>>>  4) You could take an approach like Dominic Williams proposed with
>>>>> Cages:
>>>>> http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
>>>>>    The gist is that you snapshot all the original values that you're about
>>>>> to munge somewhere else (in his case, ZooKeeper), make your updates, and
>>>>> then delete the snapshot (and that delete needs to be atomic).  If the
>>>>> snapshot data was never deleted, then subsequent accessors (even readers)
>>>>> of the data rows need to do the rollback of the previous transaction
>>>>> themselves before they can read/write this data.  They do the rollback by
>>>>> just overwriting the current values with what is in the snapshot.  It
>>>>> offloads the work of the rollback to the next worker that accesses the
>>>>> data.  This approach probably needs an generic/high-level programming 
>>>>> layer
>>>>> to handle all of the details and complexity, and it doesn't seem like it
>>>>> was ever added to Cages.
>>>>>
>>>>>
>>>>>  Are there other approaches or best practices that I missed?  I would
>>>>> be very interested in hearing any opinions from those who have tackled
>>>>> these problems before.
>>>>>
>>>>>  Thanks!
>>>>>  John
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>   --
>>>> Jérémy
>>>>
>>>
>>>
>>>
>>
>

Reply via email to