Re: Why Cassandra secondary indexes are so slow on just 350k rows?

Tyler Hobbs Thu, 30 Aug 2012 11:45:35 -0700

pycassa already breaks up the query into smaller chunks, but you should try
playing with the buffer_size kwarg for get_indexed_slices, perhaps lowering
it to ~300, as Aaron suggests:
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices


On Wed, Aug 29, 2012 at 11:40 PM, aaron morton <aa...@thelastpickle.com>wrote:

>  *from 12 to 20 seconds (!!!) to find 5000 rows*.
>
> More is not always better.
>
> Cassandra must materialise the full 5000 rows and send them all over the
> wire to be materialised on the other side. Try asking for a few hundred at
> a time and see how it goes.
>
> Cheers
>
> -----------------
> Aaron Morton
> Freelance Developer
> @aaronmorton
> http://www.thelastpickle.com
>
> On 29/08/2012, at 6:46 PM, Robin Verlangen <ro...@us2.nl> wrote:
>
> @Edward: I think you should consider a queue for exporting the new rows.
> Just store the rowkey in a queue (you might want to consider looking at
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
>  )
> and process that row every couple of minutes. Then manually delete columns
> from that queue-row.
>
> With kind regards,
>
> Robin Verlangen
> *Software engineer*
> *
> *
> W http://www.robinverlangen.nl
> E ro...@us2.nl
>
> Disclaimer: The information contained in this message and attachments is
> intended solely for the attention and use of the named addressee and may be
> confidential. If you are not the intended recipient, you are reminded that
> the information remains the property of the sender. You must not use,
> disclose, distribute, copy, print or rely on this e-mail. If you have
> received this message in error, please contact the sender immediately and
> irrevocably delete this message and any copies.
>
>
>
> 2012/8/29 Robin Verlangen <ro...@us2.nl>
>
>> "What this means is that eventually you will have 1 row in the secondary
>> index table with 350K columns"
>>
>> Is this really true? I would have expected that Cassandra used internal
>> index sharding/bucketing?
>>
>> With kind regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E ro...@us2.nl
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>> 2012/8/29 Dave Brosius <dbros...@mebigfatguy.com>
>>
>>> If i understand you correctly, you are only ever querying for the rows
>>> where is_exported = false, and turning them into trues. What this means is
>>> that eventually you will have 1 row in the secondary index table with 350K
>>> columns that you will never look at.
>>>
>>> It seems to me you that perhaps you should just hold your own "manual
>>> index" cf that points to non exported rows, and just delete those columns
>>> when they are exported.
>>>
>>>
>>>
>>> On 08/28/2012 05:23 PM, Edward Kibardin wrote:
>>>
>>>> I have a column family with the secondary index. The secondary index is
>>>> basically a binary field, but I'm using a string for it. The field called
>>>> *is_exported* and can be *'true'* or *'false'*. After request all loaded
>>>> rows are updated with *is_exported = 'false'*.
>>>>
>>>> I'm polling this column table each ten minutes and exporting new rows
>>>> as they appear.
>>>>
>>>> But here the problem: I'm seeing that time for this query grows pretty
>>>> linear with amount of data in column table, and currently it takes *from 12
>>>> to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed
>>>> request should not depend on number of rows in CF but from number of rows
>>>> per one index value (cardinality), as it's just another hidden CF like:
>>>>
>>>>         "true" : rowKey1 rowKey2 rowKey3 ...
>>>>         "false": rowKey1 rowKey2 rowKey3 ...
>>>>
>>>> I'm using Pycassa to query the data, here the code I'm using:
>>>>
>>>>         column_family = pycassa.ColumnFamily(**cassandra_pool,
>>>> column_family_name, read_consistency_level=2)
>>>>         is_exported_expr = create_index_expression('is_**exported',
>>>> 'false')
>>>>         clause = create_index_clause([is_**exported_expr], count =
>>>> 5000)
>>>>         column_family.get_indexed_**slices(clause)
>>>>
>>>> Am I doing something wrong, but I expect this operation to work MUCH
>>>> faster.
>>>>
>>>> Any ideas or suggestions?
>>>>
>>>> Some config info:
>>>>  - Cassandra 1.1.0
>>>>  - RandomPartitioner
>>>>  - I have 2 nodes and replication_factor = 2 (each server has a full
>>>> data copy)
>>>>  - Using AWS EC2, large instances
>>>>  - Software raid0 on ephemeral drives
>>>>
>>>> Thanks in advance!
>>>>
>>>>
>>>
>>
>
>


-- 
Tyler Hobbs
DataStax <http://datastax.com/>

Re: Why Cassandra secondary indexes are so slow on just 350k rows?

Reply via email to