Re: Cassandra search performance

2012-05-12 Thread Jason Tang
I try to search one column, this column store the time as the type Long,
1,000,000 data equally distributed in 24 hours, I only want to search
certain time rang, eg from 01:30 to 01:50 or 08:00 to 12:00, but something
stranger happened.

Search 00:00 to 23:59 limit 100
It took less then 1 second scan 100 record

Search 00:00 to 00:20 limit 100
It took more then one minute scan around 2,400 recods

So the result shows it seems cassandra scan one by one to match the
condition, and the data is not ordered in sequence.

One more thing, to have equal condition, I make a redundant column to have
equal condition, the value is same for all records.
The search condition like get record where equal='equal' and time  00:00
and time  00:20

Is it the expected behavior of secondary index or I didn't use it correct.

Because I used to have another test, I have one string column most of it is
string 'true' and I add 100 'false' among 1,000,000 'true' , it shows it
only scan 100 records.

So how can I exam what happened inside cassadra, and where I can find out
the detail of how secondary works?

在 2012年5月8日星期二,Maxim Potekhin 写道:

  Thanks for the comments, much appreciated.

 Maxim


 On 5/7/2012 3:22 AM, David Jeske wrote:

 On Sun, Apr 29, 2012 at 4:32 PM, Maxim Potekhin 
 potek...@bnl.govjavascript:_e({}, 'cvml', 'potek...@bnl.gov');
  wrote:

 Looking at your example,as I think you understand, you forgo indexes by
 combining two conditions in one query, thinking along the lines of what is
 often done in RDBMS. A scan is expected in this case, and there is no
 magic to avoid it.


  This sounds like a mis-understanding of how RDBMSs work. If you combine
 two conditions in a single SQL query, the SQL execution optimizer looks at
 the cardinality of any indicies. If it can successfully predict that one of
 the conditions significantly reduces the set of rows that would be
 considered (such as a status match having 200 hits vs 1M rows in the
 table), then it selects this index for the first-iteration, and each index
 hit causes a record lookup which is then tested for the other conditions.
  (This is one of several query-execution types RDBMS systems use)

  I'm no Cassandra expert, so I don't know what it does WRT
 index-selection, but from the page written on secondary indicies, it seems
 like if you just query on status, and do the other filtering yourself it'll
 probably do what you want...

  http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes


  However, if this query is important, you can easily index on two
 conditions,
 using a composite type (look it up), or string concatenation for quick and
 easy solution.


  This is not necessarily a good idea. Creating a composite index explodes
 the index size unnecessarily. If a condition can reduce a query to 200
 records, there is no need to have a composite index including another
 condition.





Re: Cassandra search performance

2012-05-07 Thread David Jeske
On Sun, Apr 29, 2012 at 4:32 PM, Maxim Potekhin potek...@bnl.gov wrote:

 Looking at your example,as I think you understand, you forgo indexes by
 combining two conditions in one query, thinking along the lines of what is
 often done in RDBMS. A scan is expected in this case, and there is no
 magic to avoid it.


This sounds like a mis-understanding of how RDBMSs work. If you combine two
conditions in a single SQL query, the SQL execution optimizer looks at the
cardinality of any indicies. If it can successfully predict that one of the
conditions significantly reduces the set of rows that would be considered
(such as a status match having 200 hits vs 1M rows in the table), then it
selects this index for the first-iteration, and each index hit causes a
record lookup which is then tested for the other conditions.  (This is one
of several query-execution types RDBMS systems use)

I'm no Cassandra expert, so I don't know what it does WRT index-selection,
but from the page written on secondary indicies, it seems like if you just
query on status, and do the other filtering yourself it'll probably do what
you want...

http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes


 However, if this query is important, you can easily index on two
 conditions,
 using a composite type (look it up), or string concatenation for quick and
 easy solution.


This is not necessarily a good idea. Creating a composite index explodes
the index size unnecessarily. If a condition can reduce a query to 200
records, there is no need to have a composite index including another
condition.


Re: Cassandra search performance

2012-04-29 Thread Maxim Potekhin
Jason,

I'm using plenty of secondary indexes with no problem at all.

Looking at your example,as I think you understand, you forgo indexes by
combining two conditions in one query, thinking along the lines of what is
often done in RDBMS. A scan is expected in this case, and there is no
magic to avoid it.

However, if this query is important, you can easily index on two conditions,
using a composite type (look it up), or string concatenation for quick and
easy solution. Which is, you _create an additional column_ which contains a
combination of the two you want to use in a query. Then index on it.
Problem solved.
The composite solution is more elegant but what I describe works in
simple cases.
It works for me.

Maxim


On 4/25/2012 10:45 AM, Jason Tang wrote:
 1.0.8

 在 2012年4月25日 下午10:38,Philip Shon philip.s...@gmail.com
 mailto:philip.s...@gmail.com写 道:

 what version of cassandra are you using. I found a big performance
 hit when querying on the secondary index.

 I came across this bug in versions prior to 1.1

 https://issues.apache.org/jira/browse/CASSANDRA-3545

 Hope that helps.

 2012/4/25 Jason Tang ares.t...@gmail.com
 mailto:ares.t...@gmail.com

 And I found, if I only have the search condition status, it
 only scan 200 records.

 But if I combine another condition partition then it scan
 all records because partition condition match all records.

 But combine with other condition such as userName, even all
 userName is same in the 1,000,000 records, it only scan 200
 records.

 So it impacted by scan execution plan, if we have several
 search conditions, how it works? Do we have the similar
 execution plan in Cassandra?


 在 2012年4月25日 下午9:18,Jason Tang ares.t...@gmail.com
 mailto:ares.t...@gmail.com写 道:

 Hi

 We have the such CF, and use secondary index to search for
 simple data status, and among 1,000,000 row records, we
 have 200 records with status we want.

 But when we start to search, the performance is very poor,
 and check with the command ./bin/nodetool -h localhost -p
 8199 cfstats , Cassandra read 1,000,000 records, and
 Read Latency is 0.2 ms, so totally it used 200 seconds.

 It use lots of CPU, and check the stack, all thread in
 Cassandra is read from socket.

 So I wonder, how to really use index to find the 200
 records instead of scan all rows. (Supper Column?)

 /ColumnFamily: queue/
 /Key Validation Class:
 org.apache.cassandra.db.marshal.BytesType/
 /Default column value validator:
 org.apache.cassandra.db.marshal.BytesType/
 /Columns sorted by: org.apache.cassandra.db.marshal.BytesType/
 /Row cache size / save period in seconds / keys to save :
 0.0/0/all/
 /Row Cache Provider:
 org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider/
 /Key cache size / save period in seconds: 0.0/0/
 /GC grace seconds: 0/
 /Compaction min/max thresholds: 4/32/
 /Read repair chance: 0.0/
 /Replicate on write: false/
 /Bloom Filter FP chance: default/
 /Built indexes: [queue.idxStatus]/
 /Column Metadata:/
 /Column Name: status (737461747573)/
 /Validation Class: org.apache.cassandra.db.marshal.AsciiType/
 /Index Name: idxStatus/
 /Index Type: KEYS/
 /
 /
 BRs
 //Jason







Re: Cassandra search performance

2012-04-25 Thread Jason Tang
And I found, if I only have the search condition status, it only scan 200
records.

But if I combine another condition partition then it scan all records
because partition condition match all records.

But combine with other condition such as userName, even all userName is
same in the 1,000,000 records, it only scan 200 records.

So it impacted by scan execution plan, if we have several search
conditions, how it works? Do we have the similar execution plan in
Cassandra?


在 2012年4月25日 下午9:18,Jason Tang ares.t...@gmail.com写道:

 Hi

We have the such CF, and use secondary index to search for simple data
 status, and among 1,000,000 row records, we have 200 records with status
 we want.

   But when we start to search, the performance is very poor, and check
 with the command ./bin/nodetool -h localhost -p 8199 cfstats , Cassandra
 read 1,000,000 records, and Read Latency is 0.2 ms, so totally it used
 200 seconds.

   It use lots of CPU, and check the stack, all thread in Cassandra is read
 from socket.

   So I wonder, how to really use index to find the 200 records instead of
 scan all rows. (Supper Column?)

 *ColumnFamily: queue*
 *  Key Validation Class: org.apache.cassandra.db.marshal.BytesType*
 *  Default column value validator:
 org.apache.cassandra.db.marshal.BytesType*
 *  Columns sorted by: org.apache.cassandra.db.marshal.BytesType*
 *  Row cache size / save period in seconds / keys to save : 0.0/0/all*
 *  Row Cache Provider:
 org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider*
 *  Key cache size / save period in seconds: 0.0/0*
 *  GC grace seconds: 0*
 *  Compaction min/max thresholds: 4/32*
 *  Read repair chance: 0.0*
 *  Replicate on write: false*
 *  Bloom Filter FP chance: default*
 *  Built indexes: [queue.idxStatus]*
 *  Column Metadata:*
 *Column Name: status (737461747573)*
 *  Validation Class: org.apache.cassandra.db.marshal.AsciiType*
 *  Index Name: idxStatus*
 *  Index Type: KEYS*
 *
 *
 BRs
 //Jason



Re: Cassandra search performance

2012-04-25 Thread Philip Shon
what version of cassandra are you using.  I found a big performance hit
when querying on the secondary index.

I came across this bug in versions prior to 1.1

https://issues.apache.org/jira/browse/CASSANDRA-3545

Hope that helps.

2012/4/25 Jason Tang ares.t...@gmail.com

 And I found, if I only have the search condition status, it only scan
 200 records.

 But if I combine another condition partition then it scan all records
 because partition condition match all records.

 But combine with other condition such as userName, even all userName
 is same in the 1,000,000 records, it only scan 200 records.

 So it impacted by scan execution plan, if we have several search
 conditions, how it works? Do we have the similar execution plan in
 Cassandra?


 在 2012年4月25日 下午9:18,Jason Tang ares.t...@gmail.com写道:

 Hi

We have the such CF, and use secondary index to search for simple data
 status, and among 1,000,000 row records, we have 200 records with status
 we want.

   But when we start to search, the performance is very poor, and check
 with the command ./bin/nodetool -h localhost -p 8199 cfstats , Cassandra
 read 1,000,000 records, and Read Latency is 0.2 ms, so totally it used
 200 seconds.

   It use lots of CPU, and check the stack, all thread in Cassandra is
 read from socket.

   So I wonder, how to really use index to find the 200 records instead of
 scan all rows. (Supper Column?)

 *ColumnFamily: queue*
 *  Key Validation Class: org.apache.cassandra.db.marshal.BytesType*
 *  Default column value validator:
 org.apache.cassandra.db.marshal.BytesType*
 *  Columns sorted by: org.apache.cassandra.db.marshal.BytesType*
 *  Row cache size / save period in seconds / keys to save : 0.0/0/all
 *
 *  Row Cache Provider:
 org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider*
 *  Key cache size / save period in seconds: 0.0/0*
 *  GC grace seconds: 0*
 *  Compaction min/max thresholds: 4/32*
 *  Read repair chance: 0.0*
 *  Replicate on write: false*
 *  Bloom Filter FP chance: default*
 *  Built indexes: [queue.idxStatus]*
 *  Column Metadata:*
 *Column Name: status (737461747573)*
 *  Validation Class: org.apache.cassandra.db.marshal.AsciiType*
 *  Index Name: idxStatus*
 *  Index Type: KEYS*
 *
 *
 BRs
  //Jason





Re: Cassandra search performance

2012-04-25 Thread Jason Tang
1.0.8

在 2012年4月25日 下午10:38,Philip Shon philip.s...@gmail.com写道:

 what version of cassandra are you using.  I found a big performance hit
 when querying on the secondary index.

 I came across this bug in versions prior to 1.1

 https://issues.apache.org/jira/browse/CASSANDRA-3545

 Hope that helps.

 2012/4/25 Jason Tang ares.t...@gmail.com

 And I found, if I only have the search condition status, it only scan
 200 records.

 But if I combine another condition partition then it scan all records
 because partition condition match all records.

 But combine with other condition such as userName, even all userName
 is same in the 1,000,000 records, it only scan 200 records.

 So it impacted by scan execution plan, if we have several search
 conditions, how it works? Do we have the similar execution plan in
 Cassandra?


 在 2012年4月25日 下午9:18,Jason Tang ares.t...@gmail.com写道:

 Hi

We have the such CF, and use secondary index to search for simple
 data status, and among 1,000,000 row records, we have 200 records with
 status we want.

   But when we start to search, the performance is very poor, and check
 with the command ./bin/nodetool -h localhost -p 8199 cfstats , Cassandra
 read 1,000,000 records, and Read Latency is 0.2 ms, so totally it used
 200 seconds.

   It use lots of CPU, and check the stack, all thread in Cassandra is
 read from socket.

   So I wonder, how to really use index to find the 200 records instead
 of scan all rows. (Supper Column?)

 *ColumnFamily: queue*
 *  Key Validation Class: org.apache.cassandra.db.marshal.BytesType*
 *  Default column value validator:
 org.apache.cassandra.db.marshal.BytesType*
 *  Columns sorted by: org.apache.cassandra.db.marshal.BytesType*
 *  Row cache size / save period in seconds / keys to save :
 0.0/0/all*
 *  Row Cache Provider:
 org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider*
 *  Key cache size / save period in seconds: 0.0/0*
 *  GC grace seconds: 0*
 *  Compaction min/max thresholds: 4/32*
 *  Read repair chance: 0.0*
 *  Replicate on write: false*
 *  Bloom Filter FP chance: default*
 *  Built indexes: [queue.idxStatus]*
 *  Column Metadata:*
 *Column Name: status (737461747573)*
 *  Validation Class: org.apache.cassandra.db.marshal.AsciiType*
 *  Index Name: idxStatus*
 *  Index Type: KEYS*
 *
 *
 BRs
  //Jason