Re: Ultra wide row anti pattern

DuyHai Doan Tue, 04 Feb 2014 12:48:39 -0800

Great idea for implementing queue pattern. Thank you Edward.

However with your design there are still corner cases for 2 consumers to
read from the same queue. Reading and writing with QUORUM does not prevent
race conditions. I believe the new CAS feature of C* 2.0 might be useful
here but with the expense of reduced throughput (because of the Paxos round)





On Tue, Feb 4, 2014 at 4:50 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:

> I have actually been building something similar in my space time. You can
> hang around and wait for it or build your own. Here is the basics. Not
> perfect but it will work.
>
> Create column family queue with gc_grace_period=[1 day]
>
> set queue [timeuuid()] ["z"+timeuuid()] = [ work do do]
>
> The producer can decide how it wants to role over the row key and the
> column key it does not matter.
>
> Supposing there are N consumers. We need a way for the consumers to not do
> the same work. We can use something like the bakery algorithm. Remember at
> QUORUM a reader sees writes.
>
> A consumer needs an identifier (it could be another uuid or an ip address)
> A consumer calls get_range_slice on the queue the slice is from new byte[]
> to byte[] limit 100
>
> The consumer sees data like this.
>
> [1234] [z-$timeuuid] = data
>
> Now we register that this consumer wants to consume this queue
>
> set [1234] [a-$[ip}] at quorum
>
> Now we do a slice
> get_slice [1234]  from new byte [] to ' b'
>
> There are a few possible returns.
> 1) 1 bidder...
> [1234] [a-$myip]
> You won start consuming
>
> 2)  2 bidders
> [1234] [a-$myip]
> [1234] [a-$otherip]
> compare $myip vs $otherip higher wins
>
> Whoever wins can then start consuming the columns in the queue and delete
> them when done.
>
>
>
>
>
>
> On Friday, January 31, 2014, DuyHai Doan <doanduy...@gmail.com> wrote:
> > Thanks Nat for your ideas.
> >>This could be as simple as adding year and month to the primary key (in
> the form >'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it >then becomes pretty easy to re-generate
> these based on the query parameters.
> >
> >  The thing is that it's not that simple. My customer has a very BAD
> idea, using Cassandra as a queue (the perfect anti-pattern ever).
> >  Before trying to tell them to redesign their entire architecture and
> put in some queueing system like ActiveMQ or something similar, I would
> like to see how I can use wide rows to meet the requirements.
> >  The functional need is quite simple:
> >  1) A process A loads users into Cassandra and sets the status on this
> user to be 'TODO'. When using the bucketing technique, we can limit a row
> width to, let's say 100 000 columns. So at the end of the current row,
> process A knows that it should move to next bucket. Bucket is coded using
> composite partition key, in our example it would be 'TODO:1', 'TODO:2' ....
> etc
> >
> >  2) A process B reads the wide row for 'TODO' status. It starts at
> bucket 1 so it will read row with partition key 'TODO:1'. The users are
> processed and inserted in a new row 'PROCESSED:1' for example to keep track
> of the status. After retrieving 100 000 columns, it will switch
> automatically to the next bucket. Simple. Fair enough
> >
> >  3) Now what sucks it that some time, process B does not have enough
> data to perform functional logic on the user it fetched from the wide row,
> so it has to REPUT some users back into the 'TODO' status rather than
> transitioning to 'PROCESSED' status. That's exactly a queue behavior.
> >  A simplistic idea would be to insert again those m users with 'TODO:n',
> with n higher than the current bucket number so it can be processed later.
> But then it screws up all the counting system. Process A which inserts data
> will not know that there are already m users in row n, so will happily add
> 100 000 columns, making the row size grow to  100 000 + m. When process B
> reads back again this row, it will stop at the first 100 000 columns and
> skip the trailing m elements .
> >   That 's the main reason for which I dropped the idea of bucketing
> (which is quite smart in normal case) to trade for ultra wide row.
> >  Any way, I'll follow your advice and play around with the parameters of
> SizeTiered
> >  Regards
> >  Duy Hai DOAN
> >
> > On Fri, Jan 31, 2014 at 9:23 PM, Nate McCall <n...@thelastpickle.com>
> wrote:
> >>>
> >>>  The only drawback for ultra wide row I can see is point 1). But if I
> use leveled compaction with a sufficiently large value for
> "sstable_size_in_mb" (let's say 200Mb), will my read performance be
> impacted as the row grows ?
> >>
> >> For this use case, you would want to use SizeTieredCompaction and play
> around with the configuration a bit to keep a small number of large
> SSTables. Specifically: keep min|max_threshold really low, set bucket_low
> and bucket_high closer together maybe even both to 1.0, and maybe a larger
> min_sstable_size.
> >> YMMV though - per Rob's suggestion, take the time to run some tests
> tweaking these options.
> >>
> >>>
> >>>  Of course, splitting wide row into several rows using bucketing
> technique is one solution but it forces us to keep track of the bucket
> number and it's not convenient. We have one process (jvm) that insert data
> and another process (jvm) that read data. Using bucketing, we need to
> synchronize the bucket number between the 2 processes.
> >>
> >> This could be as simple as adding year and month to the primary key (in
> the form 'yyyymm'). Alternatively, you could add this in the partition in
> the definition. Either way, it then becomes pretty easy to re-generate
> these based on the query parameters.
> >>
> >>
> >> --
> >> -----------------
> >> Nate McCall
> >> Austin, TX
> >> @zznate
> >>
> >> Co-Founder & Sr. Technical Consultant
> >> Apache Cassandra Consulting
> >> http://www.thelastpickle.com
> >
>

Re: Ultra wide row anti pattern

Reply via email to