Re: Once again, super columns or composites?

2012-09-27 Thread Edward Kibardin
Oh... Sylvain, thanks a lot for such a complete answer.

Yeah, I understand my mistake in suggestions regarding composites.
It seems, composites are pretty much an advanced version of key manual
joining into a string column name: key1:key2

Thanks a lot!
Ed

On Thu, Sep 27, 2012 at 2:02 PM, Sylvain Lebresne sylv...@datastax.comwrote:

  But from my understanding, you just can't update composite column, only
  delete and insert... so this may make my update use case much more
  complicated.

 Let me try to sum things up.
 In regular column families, a column (value) is defined by 2 keys: the
 row key and the column name.
 In super column families, a column (value) is defined by 3 keys: the
 row key, the super column name and the column name.

 So a super column is really just the set of columns that share the
 same (row key, super column name) pair.

 The idea of composite columns is to use regular columns, but simply to
 distinguish multiple parts of the column name. So now if you take the
 example of a CompositeType with 2 components. In that column family:
 a column (value) is defined by 3 keys: the row key, the first
 component of the column name and the second component of the column
 name.

 In other words, composites are a *generalization* of super columns and
 super columns are the case of composites with 2 components. Except
 that super columns are hard-wired in the cassandra code base in a way
 that come with a number of limitation, the main one being that we
 always deserialize a super column (again, which is just a set of
 columns) in its entirety when we read it from disk.

 So no, it's not true that  you just can't update composite column,
 only delete and insert nor that it is not possible to add any
 sub-column to your composite.

 That being said, if you are using the thrift interface, super columns
 do have a few perks currently:
   - the grouping of all the sub-columns composing a super columns is
 hard-wired in Cassandra. The equivalent for composites, which consists
 in grouping all columns having the same value for a given component,
 must be done client side. Maybe some client library do that for you
 but I'm not sure (I don't know for Pycassa for instance).
   - there is a few queries that can be easily done with super columns
 that don't translate easily to composites, namely deleting whole super
 columns and to a less extend querying multiple super columns by name.
 That's due to a few limitations that upcoming versions of Cassandra
 will solve but it's not the case with currently released versions.

 The bottom line is: if you can do without those few perks, then you'd
 better use composites since they have less limitations. If you can't
 really do without these perks and can live with the super columns
 limitations, then go on, use super columns. (And if you want the perks
 without the limitations, wait for Cassandra 1.2 and use CQL3 :D)


  ... and as I know, DynamicComposites is not recommended (and actually not
  supported by Pycassa).

 DynamicComposites don't do what you think they do. They do nothing
 more than regular composite as far as comparing them to SuperColumns
 is concerned, except giving you ways to shoot yourself in the foot.

 --
 Sylvain



Re: Cassandra Counters

2012-09-25 Thread Edward Kibardin
I've recently noticed several threads about Cassandra
Counters inconsistencies and started seriously think about possible
workarounds like store realtime counters in Redis and dump them daily to
Cassandra.
So general question, should I rely on Counters if I want 100% accuracy?

Thanks, Ed

On Tue, Sep 25, 2012 at 8:15 AM, Robin Verlangen ro...@us2.nl wrote:

 From my point of view an other problem with using the standard column
 family for counting is transactions. Cassandra lacks of them, so if you're
 multithreaded updating counters, how will you keep track of that? Yes, I'm
 aware of software like Zookeeper to do that, however I'm not sure whether
 that's the best option.

 I think you should stick with Cassandra counter column families.

 Best regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 http://goo.gl/Lt7BC

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/9/25 Roshni Rajagopal roshni_rajago...@hotmail.com

  Thanks for the reply and sorry for being bull - headed.

 Once  you're past the stage where you've decided its distributed, and
 NoSQL and cassandra out of all the NoSQL options,
 Now to count something, you can do it in different ways in cassandra.
 In all the ways you want to use cassandra's best features of
 availability, tunable consistency , partition tolerance etc.

 Given this, what are the performance tradeoffs of using counters vs a
 standard column family for counting. Because as I see if the counter number
 in a counter column family becomes wrong, it will not be 'eventually
 consistent' - you will need intervention to correct it. So the key aspect
 is how much faster would be a counter column family, and at what numbers do
 we start seing a difference.





 --
 Date: Tue, 25 Sep 2012 07:57:08 +0200
 Subject: Re: Cassandra Counters
 From: oleksandr.pet...@gmail.com
 To: user@cassandra.apache.org


 Maybe I'm missing the point, but counting in a standard column family
 would be a little overkill.

 I assume that distributed counting here was more of a map/reduce
 approach, where Hadoop (+ Cascading, Pig, Hive, Cascalog) would help you a
 lot. We're doing some more complex counting (e.q. based on sets of rules)
 like that. Of course, that would perform _way_ slower than counting
 beforehand. On the other side, you will always have a consistent result for
 a consistent dataset.

 On the other hand, if you use things like AMQP or Storm (sorry to put up
 my sentence together like that, as tools are mostly either orthogonal or
 complementary, but I hope you get my point), you could build a topology
 that makes fault-tolerant writes independently of your original write. Of
 course, it would still have a consistency tradeoff, mostly because of race
 conditions and different network latencies etc.

 So I would say that building a data model in a distributed system often
 depends more on your problem than on the common patterns, because
 everything has a tradeoff.

 Want to have an immediate result? Modify your counter while writing the
 row.
 Can sacrifice speed, but have more counting opportunities? Go with
 offline distributed counting.
 Want to have kind of both, dispatch a message and react upon it, having
 the processing logic and writes decoupled from main application, allowing
 you to care less about speed.

 However, I may have missed the point somewhere (early morning, you know),
 so I may be wrong in any given statement.
 Cheers


 On Tue, Sep 25, 2012 at 6:53 AM, Roshni Rajagopal 
 roshni_rajago...@hotmail.com wrote:

  Thanks Milind,

 Has anyone implemented counting in a standard col family in cassandra,
 when you can have increments and decrements to the count.
 Any comparisons in performance to using counter column families?

 Regards,
 Roshni


 --
 Date: Mon, 24 Sep 2012 11:02:51 -0700
 Subject: RE: Cassandra Counters
 From: milindpar...@gmail.com
 To: user@cassandra.apache.org


 IMO
 You would use Cassandra Counters (or other variation of distributed
 counting) in case of having determined that a centralized version of
 counting is not going to work.
 You'd determine the non_feasibility of centralized counting by figuring
 the speed at which you need to sustain writes and reads and reconcile that
 with your hard disk seek times (essentially).
 Once you have proved that you can't do centralized counting, the second
 layer of arsenal comes into play; which is distributed counting.
 In distributed counting , the CAP 

Re: Cassandra Counters

2012-09-25 Thread Edward Kibardin
@Sylvain and @Rohit: Thanks for your answers.


On Tue, Sep 25, 2012 at 2:27 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 So general question, should I rely on Counters if I want 100% accuracy?


 No.

  Even not considering potential bugs, counters being not idempotent, if
 you get a TimeoutException during a write (which can happen even in
 relatively normal conditions), you won't know if the increment went in or
 not (and you have no way to know unless you have an external way to check
 the value). This is probably fine if you use counters for say real-time
 analytics, but not if you use 100% accuracy.

 --
 Sylvain



Re: Code example for CompositeType.Builder and SSTableSimpleUnsortedWriter

2012-09-24 Thread Edward Kibardin
Hey...

From my understanding, there are several ways to use composites
with SSTableSimpleUnsortedWriter but which is the best?
And as usual, code examples are welcome ;)

Thanks in advance!

On Thu, Sep 20, 2012 at 11:23 PM, Edward Kibardin infa...@gmail.com wrote:

 Hi Everyone,

 I'm writing a conversion tool from CSV files to SSTable
 using SSTableSimpleUnsortedWriter and unable to find a good example of
 using CompositeType.Builder with SSTableSimpleUnsortedWriter.
 It also will be great if someone had an sample code for insert/update only
 a single value in composites (if it possible in general).

 Quick Google search didn't help me, so I'll be very appreciated for the
 correct sample ;)

 Thanks in advance,
 Ed



Code example for CompositeType.Builder and SSTableSimpleUnsortedWriter

2012-09-20 Thread Edward Kibardin
Hi Everyone,

I'm writing a conversion tool from CSV files to SSTable
using SSTableSimpleUnsortedWriter and unable to find a good example of
using CompositeType.Builder with SSTableSimpleUnsortedWriter.
It also will be great if someone had an sample code for insert/update only
a single value in composites (if it possible in general).

Quick Google search didn't help me, so I'll be very appreciated for the
correct sample ;)

Thanks in advance,
Ed


Re: Why Cassandra secondary indexes are so slow on just 350k rows?

2012-08-30 Thread Edward Kibardin
Thanks Guys for the answers...

The main issue here seems not the secondary index, but speed of searching
for random keys in column family.
I've done the experiment and queried the same 5000 rows not using index but
providing a list of keys to Pycassa... the speed was the same.

Although, using SuperColumns I can get same 5000 rows (SuperColumns) like
in 1-2 seconds... It's understandable, as columns are stored sequentially.

So here the question, is it normal for Cassandra in general to search 5000
rows for 20 seconds or it's just something wrong with my instance?

Ed


On Thu, Aug 30, 2012 at 7:45 PM, Tyler Hobbs ty...@datastax.com wrote:

 pycassa already breaks up the query into smaller chunks, but you should
 try playing with the buffer_size kwarg for get_indexed_slices, perhaps
 lowering it to ~300, as Aaron suggests:
 http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices


 On Wed, Aug 29, 2012 at 11:40 PM, aaron morton aa...@thelastpickle.comwrote:

  *from 12 to 20 seconds (!!!) to find 5000 rows*.

 More is not always better.

 Cassandra must materialise the full 5000 rows and send them all over the
 wire to be materialised on the other side. Try asking for a few hundred at
 a time and see how it goes.

 Cheers

   -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 29/08/2012, at 6:46 PM, Robin Verlangen ro...@us2.nl wrote:

 @Edward: I think you should consider a queue for exporting the new rows.
 Just store the rowkey in a queue (you might want to consider looking at
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
  )
 and process that row every couple of minutes. Then manually delete columns
 from that queue-row.

 With kind regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/8/29 Robin Verlangen ro...@us2.nl

 What this means is that eventually you will have 1 row in the
 secondary index table with 350K columns

 Is this really true? I would have expected that Cassandra used internal
 index sharding/bucketing?

 With kind regards,

 Robin Verlangen
 *Software engineer*
 *
 *
 W http://www.robinverlangen.nl
 E ro...@us2.nl

 Disclaimer: The information contained in this message and attachments is
 intended solely for the attention and use of the named addressee and may be
 confidential. If you are not the intended recipient, you are reminded that
 the information remains the property of the sender. You must not use,
 disclose, distribute, copy, print or rely on this e-mail. If you have
 received this message in error, please contact the sender immediately and
 irrevocably delete this message and any copies.



 2012/8/29 Dave Brosius dbros...@mebigfatguy.com

 If i understand you correctly, you are only ever querying for the rows
 where is_exported = false, and turning them into trues. What this means is
 that eventually you will have 1 row in the secondary index table with 350K
 columns that you will never look at.

 It seems to me you that perhaps you should just hold your own manual
 index cf that points to non exported rows, and just delete those columns
 when they are exported.



 On 08/28/2012 05:23 PM, Edward Kibardin wrote:

 I have a column family with the secondary index. The secondary index
 is basically a binary field, but I'm using a string for it. The field
 called *is_exported* and can be *'true'* or *'false'*. After request all
 loaded rows are updated with *is_exported = 'false'*.

 I'm polling this column table each ten minutes and exporting new rows
 as they appear.

 But here the problem: I'm seeing that time for this query grows pretty
 linear with amount of data in column table, and currently it takes *from 
 12
 to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed
 request should not depend on number of rows in CF but from number of rows
 per one index value (cardinality), as it's just another hidden CF like:

 true : rowKey1 rowKey2 rowKey3 ...
 false: rowKey1 rowKey2 rowKey3 ...

 I'm using Pycassa to query the data, here the code I'm using:

 column_family = pycassa.ColumnFamily(**cassandra_pool,
 column_family_name, read_consistency_level=2)
 is_exported_expr = create_index_expression('is_**exported',
 'false')
 clause = create_index_clause([is_**exported_expr], count =
 5000