Tyler, I can consider trying out light weight transactions, but here are my concerns #1. We have 2 data centers located close by with plans to expand to more data centers which are even further away geographically. #2. How will this impact light weight transactions when there is high level of network contention for cross data center traffic. #3. Do you know of any real examples where companies have used light weight transactions in a multi-data center traffic. Regards Sachin
On Tue, Mar 24, 2015 at 10:56 AM, Tyler Hobbs <ty...@datastax.com> wrote: > do you just mean that it's easy to forget to always set your timestamp >> correctly, and if you goof it up, it makes it difficult to recover from >> (i.e. you issue a delete with system timestamp instead of document version, >> and that's way larger than your document version would ever be, so you can >> never write that document again)? > > > Yes, that's basically what I meant. Plus, if you need to make a manual > correction to a document, you'll need to increment the version, which would > presumably cause problems for your application. It's possible to handle > all of this correctly if you take care, but I wouldn't trust myself to > always get this right. > > >> @Tyler >> With your recommendation, won't I end up saving all the version(s) of the >> document. In my case the document is pretty huge (~5mb) and each document >> has up to 10 versions. And you already highlighted that light weight >> transactions are very expensive. >> > > You can always delete older versions to free up space. > > Using lightweight transactions may be a decent option if you don't have > really high write throughput and aren't expecting high contention (which I > don't think you are). I recommend testing this out with your application > to see how it performs for you. > > > On Sun, Mar 22, 2015 at 7:02 PM, Sachin Nikam <skni...@gmail.com> wrote: > >> @Eric Stevens >> Thanks for representing my position while I came back to this thread. >> >> @Tyler >> With your recommendation, won't I end up saving all the version(s) of the >> document. In my case the document is pretty huge (~5mb) and each document >> has up to 10 versions. And you already highlighted that light weight >> transactions are very expensive. >> >> Also as Eric mentions, can you elaborate on what kind of problems could >> happen when we try to overwrite or delete data? >> Regards >> Sachin >> >> On Fri, Mar 13, 2015 at 4:23 AM, Brice Dutheil <brice.duth...@gmail.com> >> wrote: >> >>> I agree with Tyler, in the normal run of a live application I would not >>> recommend the use of the timestamp, and use other ways to *version* >>> *inserts*. Otherwise you may fall in the *upsert* pitfalls that Tyler >>> mentions. >>> >>> However I find there’s a legitimate use the USING TIMESTAMP trick, when >>> migrating data form another datastore. >>> >>> The trick is at some point to enable the application to start writing >>> cassandra *without* any timestamp setting on the statements. ⇐ for >>> fresh data >>> Then start a migration batch that will use a write time with an older >>> date (i.e. when there’s *no* possible *collision* with other data). ⇐ >>> for older data >>> >>> *This tricks has been used in prod with billions of records.* >>> >>> >>> -- Brice >>> >>> On Thu, Mar 12, 2015 at 10:42 PM, Eric Stevens <migh...@gmail.com> >>> wrote: >>> >>>> Ok, but if you're using a system of time that isn't server clock >>>> oriented (Sachin's document revision ID, and my fixed and necessarily >>>> consistent base timestamp [B's always know their parent A's exact recorded >>>> timestamp]), isn't the principle of using timestamps to force a particular >>>> update out of several to win still sound? >>>> >>>> > as using the clocks is only valid if clocks are perfectly sync'ed, >>>> which they are not >>>> >>>> Clock skew is a problem which doesn't seem to be a factor in either use >>>> case given that both have a consistent external source of truth for >>>> timestamp. >>>> >>>> On Thu, Mar 12, 2015 at 12:58 PM, Jonathan Haddad <j...@jonhaddad.com> >>>> wrote: >>>> >>>>> In most datacenters you're going to see significant variance in your >>>>> server times. Likely > 20ms between servers in the same rack. Even >>>>> google, using atomic clocks, has 1-7ms variance. [1] >>>>> >>>>> I would +1 Tyler's advice here, as using the clocks is only valid if >>>>> clocks are perfectly sync'ed, which they are not, and likely never will be >>>>> in our lifetime. >>>>> >>>>> [1] http://queue.acm.org/detail.cfm?id=2745385 >>>>> >>>>> >>>>> On Thu, Mar 12, 2015 at 7:04 AM Eric Stevens <migh...@gmail.com> >>>>> wrote: >>>>> >>>>>> > It's possible, but you'll end up with problems when attempting to >>>>>> overwrite or delete entries >>>>>> >>>>>> I'm wondering if you can elucidate on that a little bit, do you just >>>>>> mean that it's easy to forget to always set your timestamp correctly, and >>>>>> if you goof it up, it makes it difficult to recover from (i.e. you issue >>>>>> a >>>>>> delete with system timestamp instead of document version, and that's way >>>>>> larger than your document version would ever be, so you can never write >>>>>> that document again)? Or is there some bug in write timestamps that can >>>>>> cause the wrong entry to win the write contention? >>>>>> >>>>>> We're looking at doing something similar to keep a live max value >>>>>> column in a given table, our setup is as follows: >>>>>> >>>>>> CREATE TABLE a ( >>>>>> id <whatever>, >>>>>> time timestamp, >>>>>> max_b_foo int, >>>>>> PRIMARY KEY (id) >>>>>> ); >>>>>> CREATE TABLE b ( >>>>>> b_id <whatever>, >>>>>> a_id <whatever>, >>>>>> a_timestamp timestamp, >>>>>> foo int, >>>>>> PRIMARY KEY (a_id, b_id) >>>>>> ); >>>>>> >>>>>> The idea being that there's a one-to-many relationship between *a* >>>>>> and *b*. We want *a* to know what the maximum value is in *b* for >>>>>> field *foo* so we can avoid reading *all* *b* when we want to >>>>>> resolve *a*. You can see that we can't just use *b*'s clustering key >>>>>> to resolve that with LIMIT 1; also this is for DSE Solr, which wouldn't >>>>>> be >>>>>> able to query a by max b.foo anyway. So when we write to *b*, we >>>>>> also write to *a* with something like >>>>>> >>>>>> UPDATE a USING TIMESTAMP ${b.a_timestamp.toMicros + b.foo} SET >>>>>> max_b_foo = ${b.foo} WHERE id = ${b.a_id} >>>>>> >>>>>> Assuming that we don't run afoul of related antipatterns such as >>>>>> repeatedly overwriting the same value indefinitely, this strikes me as >>>>>> sound if unorthodox practice, as long as conflict resolution in Cassandra >>>>>> isn't broken in some subtle way. We also designed this to be safe from >>>>>> getting write timestamps greatly out of sync with clock time so that >>>>>> non-timestamped operations (especially delete) if done accidentally will >>>>>> still have a reasonable chance of having the expected results. >>>>>> >>>>>> So while it may not be the intended use case for write timestamps, >>>>>> and there are definitely gotchas if you are not careful or misunderstand >>>>>> the consequences, as far as I can see the logic behind it is sound but >>>>>> does >>>>>> rely on correct conflict resolution in Cassandra. I'm curious if I'm >>>>>> missing or misunderstanding something important. >>>>>> >>>>>> On Wed, Mar 11, 2015 at 4:11 PM, Tyler Hobbs <ty...@datastax.com> >>>>>> wrote: >>>>>> >>>>>>> Don't use the version as your timestamp. It's possible, but you'll >>>>>>> end up with problems when attempting to overwrite or delete entries. >>>>>>> >>>>>>> Instead, make the version part of the primary key: >>>>>>> >>>>>>> CREATE TABLE document_store (document_id bigint, version int, >>>>>>> document text, PRIMARY KEY (document_id, version)) WITH CLUSTERING >>>>>>> ORDER BY >>>>>>> (version desc) >>>>>>> >>>>>>> That way you don't have to worry about overwriting higher versions >>>>>>> with a lower one, and to read the latest version, you only have to do: >>>>>>> >>>>>>> SELECT * FROM document_store WHERE document_id = ? LIMIT 1; >>>>>>> >>>>>>> Another option is to use lightweight transactions (i.e. UPDATE ... >>>>>>> SET docuement = ?, version = ? WHERE document_id = ? IF version < ?), >>>>>>> but >>>>>>> that's going to make writes much more expensive. >>>>>>> >>>>>>> On Wed, Mar 11, 2015 at 12:45 AM, Sachin Nikam <skni...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I am planning to use the Update...USING TIMESTAMP... statement to >>>>>>>> make sure that I do not overwrite fresh data with stale data while >>>>>>>> having >>>>>>>> to avoid doing at least LOCAL_QUORUM writes. >>>>>>>> >>>>>>>> Here is my table structure. >>>>>>>> >>>>>>>> Table=DocumentStore >>>>>>>> DocumentID (primaryKey, bigint) >>>>>>>> Document(text) >>>>>>>> Version(int) >>>>>>>> >>>>>>>> If the service receives 2 write requests with Version=1 and >>>>>>>> Version=2, regardless of the order of arrival, the business >>>>>>>> requirement is >>>>>>>> that we end up with Version=2 in the database. >>>>>>>> >>>>>>>> Can I use the following CQL Statement? >>>>>>>> >>>>>>>> Update DocumentStore using <versionValue> >>>>>>>> SET Document=<documentValue>, >>>>>>>> Version=<versionValue> >>>>>>>> where DocumentID=<documentIDValue>; >>>>>>>> >>>>>>>> Has anybody used something like this? If so was the behavior as >>>>>>>> expected? >>>>>>>> >>>>>>>> Regards >>>>>>>> Sachin >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Tyler Hobbs >>>>>>> DataStax <http://datastax.com/> >>>>>>> >>>>>> >>>>>> >>>> >>> >> > > > -- > Tyler Hobbs > DataStax <http://datastax.com/> >