Re: Once again, super columns or composites?
Oh... Sylvain, thanks a lot for such a complete answer. Yeah, I understand my mistake in suggestions regarding composites. It seems, composites are pretty much an advanced version of key manual joining into a string column name: key1:key2 Thanks a lot! Ed On Thu, Sep 27, 2012 at 2:02 PM, Sylvain Lebresne sylv...@datastax.comwrote: But from my understanding, you just can't update composite column, only delete and insert... so this may make my update use case much more complicated. Let me try to sum things up. In regular column families, a column (value) is defined by 2 keys: the row key and the column name. In super column families, a column (value) is defined by 3 keys: the row key, the super column name and the column name. So a super column is really just the set of columns that share the same (row key, super column name) pair. The idea of composite columns is to use regular columns, but simply to distinguish multiple parts of the column name. So now if you take the example of a CompositeType with 2 components. In that column family: a column (value) is defined by 3 keys: the row key, the first component of the column name and the second component of the column name. In other words, composites are a *generalization* of super columns and super columns are the case of composites with 2 components. Except that super columns are hard-wired in the cassandra code base in a way that come with a number of limitation, the main one being that we always deserialize a super column (again, which is just a set of columns) in its entirety when we read it from disk. So no, it's not true that you just can't update composite column, only delete and insert nor that it is not possible to add any sub-column to your composite. That being said, if you are using the thrift interface, super columns do have a few perks currently: - the grouping of all the sub-columns composing a super columns is hard-wired in Cassandra. The equivalent for composites, which consists in grouping all columns having the same value for a given component, must be done client side. Maybe some client library do that for you but I'm not sure (I don't know for Pycassa for instance). - there is a few queries that can be easily done with super columns that don't translate easily to composites, namely deleting whole super columns and to a less extend querying multiple super columns by name. That's due to a few limitations that upcoming versions of Cassandra will solve but it's not the case with currently released versions. The bottom line is: if you can do without those few perks, then you'd better use composites since they have less limitations. If you can't really do without these perks and can live with the super columns limitations, then go on, use super columns. (And if you want the perks without the limitations, wait for Cassandra 1.2 and use CQL3 :D) ... and as I know, DynamicComposites is not recommended (and actually not supported by Pycassa). DynamicComposites don't do what you think they do. They do nothing more than regular composite as far as comparing them to SuperColumns is concerned, except giving you ways to shoot yourself in the foot. -- Sylvain
Re: Cassandra Counters
I've recently noticed several threads about Cassandra Counters inconsistencies and started seriously think about possible workarounds like store realtime counters in Redis and dump them daily to Cassandra. So general question, should I rely on Counters if I want 100% accuracy? Thanks, Ed On Tue, Sep 25, 2012 at 8:15 AM, Robin Verlangen ro...@us2.nl wrote: From my point of view an other problem with using the standard column family for counting is transactions. Cassandra lacks of them, so if you're multithreaded updating counters, how will you keep track of that? Yes, I'm aware of software like Zookeeper to do that, however I'm not sure whether that's the best option. I think you should stick with Cassandra counter column families. Best regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl http://goo.gl/Lt7BC Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/9/25 Roshni Rajagopal roshni_rajago...@hotmail.com Thanks for the reply and sorry for being bull - headed. Once you're past the stage where you've decided its distributed, and NoSQL and cassandra out of all the NoSQL options, Now to count something, you can do it in different ways in cassandra. In all the ways you want to use cassandra's best features of availability, tunable consistency , partition tolerance etc. Given this, what are the performance tradeoffs of using counters vs a standard column family for counting. Because as I see if the counter number in a counter column family becomes wrong, it will not be 'eventually consistent' - you will need intervention to correct it. So the key aspect is how much faster would be a counter column family, and at what numbers do we start seing a difference. -- Date: Tue, 25 Sep 2012 07:57:08 +0200 Subject: Re: Cassandra Counters From: oleksandr.pet...@gmail.com To: user@cassandra.apache.org Maybe I'm missing the point, but counting in a standard column family would be a little overkill. I assume that distributed counting here was more of a map/reduce approach, where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a lot. We're doing some more complex counting (e.q. based on sets of rules) like that. Of course, that would perform _way_ slower than counting beforehand. On the other side, you will always have a consistent result for a consistent dataset. On the other hand, if you use things like AMQP or Storm (sorry to put up my sentence together like that, as tools are mostly either orthogonal or complementary, but I hope you get my point), you could build a topology that makes fault-tolerant writes independently of your original write. Of course, it would still have a consistency tradeoff, mostly because of race conditions and different network latencies etc. So I would say that building a data model in a distributed system often depends more on your problem than on the common patterns, because everything has a tradeoff. Want to have an immediate result? Modify your counter while writing the row. Can sacrifice speed, but have more counting opportunities? Go with offline distributed counting. Want to have kind of both, dispatch a message and react upon it, having the processing logic and writes decoupled from main application, allowing you to care less about speed. However, I may have missed the point somewhere (early morning, you know), so I may be wrong in any given statement. Cheers On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal roshni_rajago...@hotmail.com wrote: Thanks Milind, Has anyone implemented counting in a standard col family in cassandra, when you can have increments and decrements to the count. Any comparisons in performance to using counter column families? Regards, Roshni -- Date: Mon, 24 Sep 2012 11:02:51 -0700 Subject: RE: Cassandra Counters From: milindpar...@gmail.com To: user@cassandra.apache.org IMO You would use Cassandra Counters (or other variation of distributed counting) in case of having determined that a centralized version of counting is not going to work. You'd determine the non_feasibility of centralized counting by figuring the speed at which you need to sustain writes and reads and reconcile that with your hard disk seek times (essentially). Once you have proved that you can't do centralized counting, the second layer of arsenal comes into play; which is distributed counting. In distributed counting , the CAP
Re: Cassandra Counters
@Sylvain and @Rohit: Thanks for your answers. On Tue, Sep 25, 2012 at 2:27 PM, Sylvain Lebresne sylv...@datastax.comwrote: So general question, should I rely on Counters if I want 100% accuracy? No. Even not considering potential bugs, counters being not idempotent, if you get a TimeoutException during a write (which can happen even in relatively normal conditions), you won't know if the increment went in or not (and you have no way to know unless you have an external way to check the value). This is probably fine if you use counters for say real-time analytics, but not if you use 100% accuracy. -- Sylvain
Re: Code example for CompositeType.Builder and SSTableSimpleUnsortedWriter
Hey... From my understanding, there are several ways to use composites with SSTableSimpleUnsortedWriter but which is the best? And as usual, code examples are welcome ;) Thanks in advance! On Thu, Sep 20, 2012 at 11:23 PM, Edward Kibardin infa...@gmail.com wrote: Hi Everyone, I'm writing a conversion tool from CSV files to SSTable using SSTableSimpleUnsortedWriter and unable to find a good example of using CompositeType.Builder with SSTableSimpleUnsortedWriter. It also will be great if someone had an sample code for insert/update only a single value in composites (if it possible in general). Quick Google search didn't help me, so I'll be very appreciated for the correct sample ;) Thanks in advance, Ed
Code example for CompositeType.Builder and SSTableSimpleUnsortedWriter
Hi Everyone, I'm writing a conversion tool from CSV files to SSTable using SSTableSimpleUnsortedWriter and unable to find a good example of using CompositeType.Builder with SSTableSimpleUnsortedWriter. It also will be great if someone had an sample code for insert/update only a single value in composites (if it possible in general). Quick Google search didn't help me, so I'll be very appreciated for the correct sample ;) Thanks in advance, Ed
Re: Why Cassandra secondary indexes are so slow on just 350k rows?
Thanks Guys for the answers... The main issue here seems not the secondary index, but speed of searching for random keys in column family. I've done the experiment and queried the same 5000 rows not using index but providing a list of keys to Pycassa... the speed was the same. Although, using SuperColumns I can get same 5000 rows (SuperColumns) like in 1-2 seconds... It's understandable, as columns are stored sequentially. So here the question, is it normal for Cassandra in general to search 5000 rows for 20 seconds or it's just something wrong with my instance? Ed On Thu, Aug 30, 2012 at 7:45 PM, Tyler Hobbs ty...@datastax.com wrote: pycassa already breaks up the query into smaller chunks, but you should try playing with the buffer_size kwarg for get_indexed_slices, perhaps lowering it to ~300, as Aaron suggests: http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices On Wed, Aug 29, 2012 at 11:40 PM, aaron morton aa...@thelastpickle.comwrote: *from 12 to 20 seconds (!!!) to find 5000 rows*. More is not always better. Cassandra must materialise the full 5000 rows and send them all over the wire to be materialised on the other side. Try asking for a few hundred at a time and see how it goes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/08/2012, at 6:46 PM, Robin Verlangen ro...@us2.nl wrote: @Edward: I think you should consider a queue for exporting the new rows. Just store the rowkey in a queue (you might want to consider looking at http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html ) and process that row every couple of minutes. Then manually delete columns from that queue-row. With kind regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/8/29 Robin Verlangen ro...@us2.nl What this means is that eventually you will have 1 row in the secondary index table with 350K columns Is this really true? I would have expected that Cassandra used internal index sharding/bucketing? With kind regards, Robin Verlangen *Software engineer* * * W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies. 2012/8/29 Dave Brosius dbros...@mebigfatguy.com If i understand you correctly, you are only ever querying for the rows where is_exported = false, and turning them into trues. What this means is that eventually you will have 1 row in the secondary index table with 350K columns that you will never look at. It seems to me you that perhaps you should just hold your own manual index cf that points to non exported rows, and just delete those columns when they are exported. On 08/28/2012 05:23 PM, Edward Kibardin wrote: I have a column family with the secondary index. The secondary index is basically a binary field, but I'm using a string for it. The field called *is_exported* and can be *'true'* or *'false'*. After request all loaded rows are updated with *is_exported = 'false'*. I'm polling this column table each ten minutes and exporting new rows as they appear. But here the problem: I'm seeing that time for this query grows pretty linear with amount of data in column table, and currently it takes *from 12 to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed request should not depend on number of rows in CF but from number of rows per one index value (cardinality), as it's just another hidden CF like: true : rowKey1 rowKey2 rowKey3 ... false: rowKey1 rowKey2 rowKey3 ... I'm using Pycassa to query the data, here the code I'm using: column_family = pycassa.ColumnFamily(**cassandra_pool, column_family_name, read_consistency_level=2) is_exported_expr = create_index_expression('is_**exported', 'false') clause = create_index_clause([is_**exported_expr], count = 5000