pycassa already breaks up the query into smaller chunks, but you should try playing with the buffer_size kwarg for get_indexed_slices, perhaps lowering it to ~300, as Aaron suggests: http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices
On Wed, Aug 29, 2012 at 11:40 PM, aaron morton <aa...@thelastpickle.com>wrote: > *from 12 to 20 seconds (!!!) to find 5000 rows*. > > More is not always better. > > Cassandra must materialise the full 5000 rows and send them all over the > wire to be materialised on the other side. Try asking for a few hundred at > a time and see how it goes. > > Cheers > > ----------------- > Aaron Morton > Freelance Developer > @aaronmorton > http://www.thelastpickle.com > > On 29/08/2012, at 6:46 PM, Robin Verlangen <ro...@us2.nl> wrote: > > @Edward: I think you should consider a queue for exporting the new rows. > Just store the rowkey in a queue (you might want to consider looking at > http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html > ) > and process that row every couple of minutes. Then manually delete columns > from that queue-row. > > With kind regards, > > Robin Verlangen > *Software engineer* > * > * > W http://www.robinverlangen.nl > E ro...@us2.nl > > Disclaimer: The information contained in this message and attachments is > intended solely for the attention and use of the named addressee and may be > confidential. If you are not the intended recipient, you are reminded that > the information remains the property of the sender. You must not use, > disclose, distribute, copy, print or rely on this e-mail. If you have > received this message in error, please contact the sender immediately and > irrevocably delete this message and any copies. > > > > 2012/8/29 Robin Verlangen <ro...@us2.nl> > >> "What this means is that eventually you will have 1 row in the secondary >> index table with 350K columns" >> >> Is this really true? I would have expected that Cassandra used internal >> index sharding/bucketing? >> >> With kind regards, >> >> Robin Verlangen >> *Software engineer* >> * >> * >> W http://www.robinverlangen.nl >> E ro...@us2.nl >> >> Disclaimer: The information contained in this message and attachments is >> intended solely for the attention and use of the named addressee and may be >> confidential. If you are not the intended recipient, you are reminded that >> the information remains the property of the sender. You must not use, >> disclose, distribute, copy, print or rely on this e-mail. If you have >> received this message in error, please contact the sender immediately and >> irrevocably delete this message and any copies. >> >> >> >> 2012/8/29 Dave Brosius <dbros...@mebigfatguy.com> >> >>> If i understand you correctly, you are only ever querying for the rows >>> where is_exported = false, and turning them into trues. What this means is >>> that eventually you will have 1 row in the secondary index table with 350K >>> columns that you will never look at. >>> >>> It seems to me you that perhaps you should just hold your own "manual >>> index" cf that points to non exported rows, and just delete those columns >>> when they are exported. >>> >>> >>> >>> On 08/28/2012 05:23 PM, Edward Kibardin wrote: >>> >>>> I have a column family with the secondary index. The secondary index is >>>> basically a binary field, but I'm using a string for it. The field called >>>> *is_exported* and can be *'true'* or *'false'*. After request all loaded >>>> rows are updated with *is_exported = 'false'*. >>>> >>>> I'm polling this column table each ten minutes and exporting new rows >>>> as they appear. >>>> >>>> But here the problem: I'm seeing that time for this query grows pretty >>>> linear with amount of data in column table, and currently it takes *from 12 >>>> to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed >>>> request should not depend on number of rows in CF but from number of rows >>>> per one index value (cardinality), as it's just another hidden CF like: >>>> >>>> "true" : rowKey1 rowKey2 rowKey3 ... >>>> "false": rowKey1 rowKey2 rowKey3 ... >>>> >>>> I'm using Pycassa to query the data, here the code I'm using: >>>> >>>> column_family = pycassa.ColumnFamily(**cassandra_pool, >>>> column_family_name, read_consistency_level=2) >>>> is_exported_expr = create_index_expression('is_**exported', >>>> 'false') >>>> clause = create_index_clause([is_**exported_expr], count = >>>> 5000) >>>> column_family.get_indexed_**slices(clause) >>>> >>>> Am I doing something wrong, but I expect this operation to work MUCH >>>> faster. >>>> >>>> Any ideas or suggestions? >>>> >>>> Some config info: >>>> - Cassandra 1.1.0 >>>> - RandomPartitioner >>>> - I have 2 nodes and replication_factor = 2 (each server has a full >>>> data copy) >>>> - Using AWS EC2, large instances >>>> - Software raid0 on ephemeral drives >>>> >>>> Thanks in advance! >>>> >>>> >>> >> > > -- Tyler Hobbs DataStax <http://datastax.com/>