Re: Why Cassandra secondary indexes are so slow on just 350k rows?

Edward Kibardin Thu, 30 Aug 2012 13:14:59 -0700

Thanks Guys for the answers...

The main issue here seems not the secondary index, but speed of searching
for random keys in column family.
I've done the experiment and queried the same 5000 rows not using index but
providing a list of keys to Pycassa... the speed was the same.


Although, using SuperColumns I can get same 5000 rows (SuperColumns) like
in 1-2 seconds... It's understandable, as columns are stored sequentially.

So here the question, is it normal for Cassandra in general to search 5000
rows for 20 seconds or it's just something wrong with my instance?

Ed


On Thu, Aug 30, 2012 at 7:45 PM, Tyler Hobbs <ty...@datastax.com> wrote:

> pycassa already breaks up the query into smaller chunks, but you should
> try playing with the buffer_size kwarg for get_indexed_slices, perhaps
> lowering it to ~300, as Aaron suggests:
> http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_indexed_slices
>
>
> On Wed, Aug 29, 2012 at 11:40 PM, aaron morton <aa...@thelastpickle.com>wrote:
>
>>  *from 12 to 20 seconds (!!!) to find 5000 rows*.
>>
>> More is not always better.
>>
>> Cassandra must materialise the full 5000 rows and send them all over the
>> wire to be materialised on the other side. Try asking for a few hundred at
>> a time and see how it goes.
>>
>> Cheers
>>
>>   -----------------
>> Aaron Morton
>> Freelance Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 29/08/2012, at 6:46 PM, Robin Verlangen <ro...@us2.nl> wrote:
>>
>> @Edward: I think you should consider a queue for exporting the new rows.
>> Just store the rowkey in a queue (you might want to consider looking at
>> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Distributed-work-queues-td5226248.html
>>  )
>> and process that row every couple of minutes. Then manually delete columns
>> from that queue-row.
>>
>> With kind regards,
>>
>> Robin Verlangen
>> *Software engineer*
>> *
>> *
>> W http://www.robinverlangen.nl
>> E ro...@us2.nl
>>
>> Disclaimer: The information contained in this message and attachments is
>> intended solely for the attention and use of the named addressee and may be
>> confidential. If you are not the intended recipient, you are reminded that
>> the information remains the property of the sender. You must not use,
>> disclose, distribute, copy, print or rely on this e-mail. If you have
>> received this message in error, please contact the sender immediately and
>> irrevocably delete this message and any copies.
>>
>>
>>
>> 2012/8/29 Robin Verlangen <ro...@us2.nl>
>>
>>> "What this means is that eventually you will have 1 row in the
>>> secondary index table with 350K columns"
>>>
>>> Is this really true? I would have expected that Cassandra used internal
>>> index sharding/bucketing?
>>>
>>> With kind regards,
>>>
>>> Robin Verlangen
>>> *Software engineer*
>>> *
>>> *
>>> W http://www.robinverlangen.nl
>>> E ro...@us2.nl
>>>
>>> Disclaimer: The information contained in this message and attachments is
>>> intended solely for the attention and use of the named addressee and may be
>>> confidential. If you are not the intended recipient, you are reminded that
>>> the information remains the property of the sender. You must not use,
>>> disclose, distribute, copy, print or rely on this e-mail. If you have
>>> received this message in error, please contact the sender immediately and
>>> irrevocably delete this message and any copies.
>>>
>>>
>>>
>>> 2012/8/29 Dave Brosius <dbros...@mebigfatguy.com>
>>>
>>>> If i understand you correctly, you are only ever querying for the rows
>>>> where is_exported = false, and turning them into trues. What this means is
>>>> that eventually you will have 1 row in the secondary index table with 350K
>>>> columns that you will never look at.
>>>>
>>>> It seems to me you that perhaps you should just hold your own "manual
>>>> index" cf that points to non exported rows, and just delete those columns
>>>> when they are exported.
>>>>
>>>>
>>>>
>>>> On 08/28/2012 05:23 PM, Edward Kibardin wrote:
>>>>
>>>>> I have a column family with the secondary index. The secondary index
>>>>> is basically a binary field, but I'm using a string for it. The field
>>>>> called *is_exported* and can be *'true'* or *'false'*. After request all
>>>>> loaded rows are updated with *is_exported = 'false'*.
>>>>>
>>>>> I'm polling this column table each ten minutes and exporting new rows
>>>>> as they appear.
>>>>>
>>>>> But here the problem: I'm seeing that time for this query grows pretty
>>>>> linear with amount of data in column table, and currently it takes *from 
>>>>> 12
>>>>> to 20 seconds (!!!) to find 5000 rows*. From my understanding, indexed
>>>>> request should not depend on number of rows in CF but from number of rows
>>>>> per one index value (cardinality), as it's just another hidden CF like:
>>>>>
>>>>>         "true" : rowKey1 rowKey2 rowKey3 ...
>>>>>         "false": rowKey1 rowKey2 rowKey3 ...
>>>>>
>>>>> I'm using Pycassa to query the data, here the code I'm using:
>>>>>
>>>>>         column_family = pycassa.ColumnFamily(**cassandra_pool,
>>>>> column_family_name, read_consistency_level=2)
>>>>>         is_exported_expr = create_index_expression('is_**exported',
>>>>> 'false')
>>>>>         clause = create_index_clause([is_**exported_expr], count =
>>>>> 5000)
>>>>>         column_family.get_indexed_**slices(clause)
>>>>>
>>>>> Am I doing something wrong, but I expect this operation to work MUCH
>>>>> faster.
>>>>>
>>>>> Any ideas or suggestions?
>>>>>
>>>>> Some config info:
>>>>>  - Cassandra 1.1.0
>>>>>  - RandomPartitioner
>>>>>  - I have 2 nodes and replication_factor = 2 (each server has a full
>>>>> data copy)
>>>>>  - Using AWS EC2, large instances
>>>>>  - Software raid0 on ephemeral drives
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>>
>>>>
>>>
>>
>>
>
>
> --
> Tyler Hobbs
> DataStax <http://datastax.com/>
>
>

Re: Why Cassandra secondary indexes are so slow on just 350k rows?

Reply via email to