[jira] [Updated] (CASSANDRA-10050) Secondary Index Performance Dependent on TokenRange Searched in Analytics

2015-10-23 Thread Aleksey Yeschenko (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko updated CASSANDRA-10050:
--
Issue Type: Improvement  (was: Bug)

> Secondary Index Performance Dependent on TokenRange Searched in Analytics
> -
>
> Key: CASSANDRA-10050
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10050
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
> Environment: Single node, macbook, 2.1.8
>Reporter: Russell Alexander Spitzer
> Fix For: 3.x
>
>
> In doing some test work on the Spark Cassandra Connector I saw some odd 
> performance when pushing down range queries with Secondary Index filters. 
> When running the queries we see huge amount of time when the C* server is not 
> doing any work and the query seem to be hanging. This investigation led to 
> the work in this document
> https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0
> The Spark Cassandra Connector builds up token range specific queries and 
> allows the user to pushdown relevant fields to C*. Here we have two indexed 
> fields (size) and (color) being pushed down to C*. 
> {code}
> SELECT count(*) FROM ks.tab WHERE token("store") > $min AND token("store") <= 
> $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}
> These queries will have different token ranges inserted and executed as 
> separate spark tasks. Spark tasks with token ranges near the Min(token) end 
> up executing much faster than those near Max(token) which also happen to 
> through errors.
> {code}
> Coordinator node timed out waiting for replica nodes' responses] 
> message="Operation timed out - received only 0 responses." 
> info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
> {code}
> I took the queries and ran them through CQLSH to see the difference in time. 
> A linear relationship is seen based on where the tokenRange being queried is 
> starting with only 2 second for queries near the beginning of the full token 
> spectrum and over 12 seconds at the end of the spectrum. 
> The question is, can this behavior be improved? or should we not recommend 
> using secondary indexes with Analytics workloads?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10050) Secondary Index Performance Dependent on TokenRange Searched in Analytics

2015-08-21 Thread Jonathan Ellis (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Ellis updated CASSANDRA-10050:
---
Fix Version/s: 3.x

 Secondary Index Performance Dependent on TokenRange Searched in Analytics
 -

 Key: CASSANDRA-10050
 URL: https://issues.apache.org/jira/browse/CASSANDRA-10050
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Single node, macbook, 2.1.8
Reporter: Russell Alexander Spitzer
 Fix For: 3.x


 In doing some test work on the Spark Cassandra Connector I saw some odd 
 performance when pushing down range queries with Secondary Index filters. 
 When running the queries we see huge amount of time when the C* server is not 
 doing any work and the query seem to be hanging. This investigation led to 
 the work in this document
 https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0
 The Spark Cassandra Connector builds up token range specific queries and 
 allows the user to pushdown relevant fields to C*. Here we have two indexed 
 fields (size) and (color) being pushed down to C*. 
 {code}
 SELECT count(*) FROM ks.tab WHERE token(store)  $min AND token(store) = 
 $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}
 These queries will have different token ranges inserted and executed as 
 separate spark tasks. Spark tasks with token ranges near the Min(token) end 
 up executing much faster than those near Max(token) which also happen to 
 through errors.
 {code}
 Coordinator node timed out waiting for replica nodes' responses] 
 message=Operation timed out - received only 0 responses. 
 info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
 {code}
 I took the queries and ran them through CQLSH to see the difference in time. 
 A linear relationship is seen based on where the tokenRange being queried is 
 starting with only 2 second for queries near the beginning of the full token 
 spectrum and over 12 seconds at the end of the spectrum. 
 The question is, can this behavior be improved? or should we not recommend 
 using secondary indexes with Analytics workloads?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-10050) Secondary Index Performance Dependent on TokenRange Searched in Analytics

2015-08-11 Thread Russell Alexander Spitzer (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Russell Alexander Spitzer updated CASSANDRA-10050:
--
Description: 
In doing some test work on the Spark Cassandra Connector I saw some odd 
performance when pushing down range queries with Secondary Index filters. When 
running the queries we see huge amount of time when the C* server is not doing 
any work and the query seem to be hanging. This investigation led to the work 
in this document

https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0

The Spark Cassandra Connector builds up token range specific queries and allows 
the user to pushdown relevant fields to C*. Here we have two indexed fields 
(size) and (color) being pushed down to C*. 

{code}
SELECT count(*) FROM ks.tab WHERE token(store)  $min AND token(store) = 
$max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}

These queries will have different token ranges inserted and executed as 
separate spark tasks. Spark tasks with token ranges near the Min(token) end up 
executing much faster than those near Max(token) which also happen to through 
errors.

{code}
Coordinator node timed out waiting for replica nodes' responses] 
message=Operation timed out - received only 0 responses. 
info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
{code}

I took the queries and ran them through CQLSH to see the difference in time. A 
linear relationship is seen based on where the tokenRange being queried is 
starting with only 2 second for queries near the beginning of the full token 
spectrum and over 12 seconds at the end of the spectrum. 

The question is, can this behavior be improved? or should we not recommend 
using secondary indexes with Analytics workloads?



  was:
In doing some test work on the Spark Cassandra Connector I saw some odd 
performance when pushing down range queries with Secondary Index filters. When 
running the queries we see huge amount of time when the C* server is not doing 
any work and the query seem to be hanging. This investigation led to the work 
in this document

https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0

The Spark Cassandra Connector builds up token range specific queries and allows 
the user to pushdown relevant fields to C*. Here we should two indexed fields 
(size) and (color) being pushed down to C*. 

{code}
SELECT count(*) FROM ks.tab WHERE token(store)  $min AND token(store) = 
$max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}

These queries will have different token ranges inserted and executed as 
separate spark tasks. Spark tasks with token ranges near the Min(token) end up 
executing much faster than those near Max(token) which also happen to through 
errors.

{code}
Coordinator node timed out waiting for replica nodes' responses] 
message=Operation timed out - received only 0 responses. 
info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'}
{code}

I took the queries and ran them through CQLSH to see the difference in time. A 
linear relationship is seen based on where the tokenRange being queried is 
starting with only 2 second for queries near the beginning of the full token 
spectrum and over 12 seconds at the end of the spectrum. 

The question is, can this behavior be improved? or should we not recommend 
using secondary indexes with Analytics workloads?




 Secondary Index Performance Dependent on TokenRange Searched in Analytics
 -

 Key: CASSANDRA-10050
 URL: https://issues.apache.org/jira/browse/CASSANDRA-10050
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: Single node, macbook, 2.1.8
Reporter: Russell Alexander Spitzer

 In doing some test work on the Spark Cassandra Connector I saw some odd 
 performance when pushing down range queries with Secondary Index filters. 
 When running the queries we see huge amount of time when the C* server is not 
 doing any work and the query seem to be hanging. This investigation led to 
 the work in this document
 https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0
 The Spark Cassandra Connector builds up token range specific queries and 
 allows the user to pushdown relevant fields to C*. Here we have two indexed 
 fields (size) and (color) being pushed down to C*. 
 {code}
 SELECT count(*) FROM ks.tab WHERE token(store)  $min AND token(store) = 
 $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code}
 These queries will have different token ranges inserted and executed as 
 separate spark tasks. Spark tasks with token ranges near the Min(token) end 
 up executing much faster than those near