Re: Strong Consistency with ONE read/writes

AJ Sun, 03 Jul 2011 16:52:49 -0700

On 7/3/2011 4:07 PM, Yang wrote:

I'm no expert. So addressing the question to me probably give you realanswers :)
The single entry mode makes sure that all writes coming through theleader are received by replicas before ack to client. Probably wont bestale data

That doesn't sound any different than a TWO write. I'm trying to save ahop (+ 1 data xfer) by ack'ing immediately after the primarysuccessfully writes, i.e., ONE write.

On Jul 3, 2011 11:20 AM, "AJ" <a...@dude.podzone.net<mailto:a...@dude.podzone.net>> wrote:

> Yang,
>
> How would you deal with the problem when the 1st node responds success
> but then crashes before completely forwarding any replicas? Then, after
> switching to the next primary, a read would return stale data.
>
> Here's a quick-n-dirty way: Send the value to the primary replica and
> send placeholder values to the other replicas. The placeholder value is
> something like, "PENDING_UPDATE". The placeholder values are sent with
> timestamps 1 less than the timestamp for the actual value that went to
> the primary. Later, when the changes propagate, the actual values will
> overwrite the placeholders. In event of a crash before the placeholder
> gets overwritten, the next read value will tell the client so. The
> client will report to the user that the key/column is unavailable. The
> downside is you've overwritten your data and maybe would like to know
> what the old data was! But, maybe there's another way using other
> columns or with MVCC. The client would want a success from the primary
> and the secondary replicas to be certain of future read consistency in
> case the primary goes down immediately as I said above. The ability to
> set an "update_pending" flag on any column value would probably make
> this work. But, I'll think more on this later.
>
> aj
>
> On 7/2/2011 10:55 AM, Yang wrote:
>> there is a JIRA completed in 0.7.x that "Prefers" a certain node in
>> snitch, so this does roughly what you want MOST of the time
>>
>>
>> but the problem is that it does not GUARANTEE that the same node will
>> always be read. I recently read into the HBase vs Cassandra
>> comparison thread that started after Facebook dropped Cassandra for
>> their messaging system, and understood some of the differences. what
>> you want is essentially what HBase does. the fundamental difference
>> there is really due to the gossip protocol: it's a probablistic, or
>> eventually consistent failure detector while HBase/Google Bigtable
>> use Zookeeper/Chubby to provide a strong failure detector (a
>> distributed lock). so in HBase, if a tablet server goes down, it
>> really goes down, it can not re-grab the tablet from the new tablet
>> server without going through a start up protocol (notifying the
>> master, which would notify the clients etc), in other words it is
>> guaranteed that one tablet is served by only one tablet server at any
>> given time. in comparison the above JIRA only TRYIES to serve that
>> key from one particular replica. HBase can have that guarantee because
>> the group membership is maintained by the strong failure detector.
>>
>> just for hacking curiosity, a strong failure detector + Cassandra
>> replicas is not impossible (actually seems not difficult), although
>> the performance is not clear. what would such a strong failure
>> detector bring to Cassandra besides this ONE-ONE strong consistency ?
>> that is an interesting question I think.
>>
>> considering that HBase has been deployed on big clusters, it is
>> probably OK with the performance of the strong Zookeeper failure
>> detector. then a further question was: why did Dynamo originally
>> choose to use the probablistic failure detector? yes Dynamo's main
>> theme is "eventually consistent", so the Phi-detector is **enough**,
>> but if a strong detector buys us more with little cost, wouldn't that
>> be great?
>>
>>
>>

>> On Fri, Jul 1, 2011 at 6:53 PM, AJ <a...@dude.podzone.net<mailto:a...@dude.podzone.net>

>> <mailto:a...@dude.podzone.net <mailto:a...@dude.podzone.net>>> wrote:
>>
>> Is this possible?
>>
>> All reads and writes for a given key will always go to the same
>> node from a client. It seems the only thing needed is to allow
>> the clients to compute which node is the closes replica for the
>> given key using the same algorithm C* uses. When the first
>> replica receives the write request, it will write to itself which
>> should complete before any of the other replicas and then return.
>> The loads should still stay balanced if using random partitioner.
>> If the first replica becomes unavailable (however that is
>> defined), then the clients can send to the next repilca in the
>> ring and switch from ONE write/reads to QUORUM write/reads
>> temporarily until the first replica becomes available again.
>> QUORUM is required since there could be some replicas that were
>> not updated after the first replica went down.
>>
>> Will this work? The goal is to have strong consistency with a
>> read/write consistency level as low as possible while secondarily
>> a network performance boost.
>>
>>
>

Re: Strong Consistency with ONE read/writes

Reply via email to