On 7/3/2011 6:32 PM, William Oberman wrote:
Was just going off of: " Send the value to the primary replica and send placeholder values to the other replicas". Sounded like you wanted to write the value to one, and write the placeholder to N-1 to me.

Yes, that is what I was suggesting. The point of the placeholders is to handle the crash case that I talked about... "like" a WAL does.

But, C* will propagate the value to N-1 eventually anyways, 'cause that's just what it does anyways :-)

will

On Sun, Jul 3, 2011 at 7:47 PM, AJ <a...@dude.podzone.net <mailto:a...@dude.podzone.net>> wrote:

    On 7/3/2011 3:49 PM, Will Oberman wrote:
    Why not send the value itself instead of a placeholder?  Now it
    takes 2x writes on a random node to do a single update (write
    placeholder, write update) and N*x writes from the client (write
    value, write placeholder to N-1). Where N is replication factor.
     Seems like extra network and IO instead of less...

    To send the value to each node is 1.) unnecessary, 2.) will only
    cause a large burst of network traffic.  Think about if it's a
large data value, such as a document. Just let C* do it's thing. The extra messages are tiny and doesn't significantly increase
    latency since they are all sent asynchronously.


    Of course, I still think this sounds like reimplementing
    Cassandra internals in a Cassandra client (just guessing, I'm not
    a cassandra dev)


    I don't see how.  Maybe you should take a peek at the source.



    On Jul 3, 2011, at 5:20 PM, AJ <a...@dude.podzone.net
    <mailto:a...@dude.podzone.net>> wrote:

    Yang,

    How would you deal with the problem when the 1st node responds
    success but then crashes before completely forwarding any
    replicas?  Then, after switching to the next primary, a read
    would return stale data.

    Here's a quick-n-dirty way:  Send the value to the primary
    replica and send placeholder values to the other replicas.  The
    placeholder value is something like, "PENDING_UPDATE".  The
    placeholder values are sent with timestamps 1 less than the
    timestamp for the actual value that went to the primary.  Later,
    when the changes propagate, the actual values will overwrite the
    placeholders.  In event of a crash before the placeholder gets
    overwritten, the next read value will tell the client so.  The
    client will report to the user that the key/column is
    unavailable.  The downside is you've overwritten your data and
    maybe would like to know what the old data was!  But, maybe
    there's another way using other columns or with MVCC.  The
    client would want a success from the primary and the secondary
    replicas to be certain of future read consistency in case the
    primary goes down immediately as I said above.  The ability to
    set an "update_pending" flag on any column value would probably
    make this work.  But, I'll think more on this later.

    aj

    On 7/2/2011 10:55 AM, Yang wrote:
    there is a JIRA completed in 0.7.x that "Prefers" a certain
    node in snitch, so this does roughly what you want MOST of the
    time


    but the problem is that it does not GUARANTEE that the same
    node will always be read.  I recently read into the HBase vs
    Cassandra comparison thread that started after Facebook dropped
    Cassandra for their messaging system, and understood some of
    the differences. what you want is essentially what HBase does.
    the fundamental difference there is really due to the gossip
    protocol: it's a probablistic, or eventually consistent failure
    detector  while HBase/Google Bigtable use Zookeeper/Chubby to
    provide a strong failure detector (a distributed lock).  so in
    HBase, if a tablet server goes down, it really goes down, it
    can not re-grab the tablet from the new tablet server without
    going through a start up protocol (notifying the master, which
    would notify the clients etc),  in other words it is guaranteed
    that one tablet is served by only one tablet server at any
    given time.  in comparison the above JIRA only TRYIES to serve
    that key from one particular replica. HBase can have that
    guarantee because the group membership is maintained by the
    strong failure detector.

    just for hacking curiosity, a strong failure detector +
    Cassandra replicas is not impossible (actually seems not
    difficult), although the performance is not clear. what would
    such a strong failure detector bring to Cassandra besides this
    ONE-ONE strong consistency ? that is an interesting question I
    think.

    considering that HBase has been deployed on big clusters, it is
    probably OK with the performance of the strong  Zookeeper
    failure detector. then a further question was: why did Dynamo
    originally choose to use the probablistic failure detector? yes
    Dynamo's main theme is "eventually consistent", so the
    Phi-detector is **enough**, but if a strong detector buys us
    more with little cost, wouldn't that  be great?



    On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net
    <mailto:a...@dude.podzone.net>> wrote:

        Is this possible?

        All reads and writes for a given key will always go to the
        same node from a client.  It seems the only thing needed is
        to allow the clients to compute which node is the closes
        replica for the given key using the same algorithm C* uses.
         When the first replica receives the write request, it will
        write to itself which should complete before any of the
        other replicas and then return.  The loads should still
        stay balanced if using random partitioner.  If the first
        replica becomes unavailable (however that is defined), then
        the clients can send to the next repilca in the ring and
        switch from ONE write/reads to QUORUM write/reads
        temporarily until the first replica becomes available
        again.  QUORUM is required since there could be some
        replicas that were not updated after the first replica went
        down.

        Will this work?  The goal is to have strong consistency
        with a read/write consistency level as low as possible
        while secondarily a network performance boost.







--
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com <mailto:ober...@civicscience.com>

Reply via email to