[ https://issues.apache.org/jira/browse/CASSANDRA-10050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15115370#comment-15115370 ]
Jonathan Ellis commented on CASSANDRA-10050: -------------------------------------------- Is there a compelling reason to use key order, or is just one of those things that seemed like a good idea at the time? Predicate pushdown for scans is basically the only reason to use 2i, so we should probably fix this. > Secondary Index Performance Dependent on TokenRange Searched in Analytics > ------------------------------------------------------------------------- > > Key: CASSANDRA-10050 > URL: https://issues.apache.org/jira/browse/CASSANDRA-10050 > Project: Cassandra > Issue Type: Improvement > Environment: Single node, macbook, 2.1.8 > Reporter: Russell Alexander Spitzer > Fix For: 3.x > > > In doing some test work on the Spark Cassandra Connector I saw some odd > performance when pushing down range queries with Secondary Index filters. > When running the queries we see huge amount of time when the C* server is not > doing any work and the query seem to be hanging. This investigation led to > the work in this document > https://docs.google.com/spreadsheets/d/1aJg3KX7nPnY77RJ9ZT-IfaYADgJh0A--nAxItvC6hb4/edit#gid=0 > The Spark Cassandra Connector builds up token range specific queries and > allows the user to pushdown relevant fields to C*. Here we have two indexed > fields (size) and (color) being pushed down to C*. > {code} > SELECT count(*) FROM ks.tab WHERE token("store") > $min AND token("store") <= > $max AND color = 'red' AND size = 'P' ALLOW FILTERING;{code} > These queries will have different token ranges inserted and executed as > separate spark tasks. Spark tasks with token ranges near the Min(token) end > up executing much faster than those near Max(token) which also happen to > through errors. > {code} > Coordinator node timed out waiting for replica nodes' responses] > message="Operation timed out - received only 0 responses." > info={'received_responses': 0, 'required_responses': 1, 'consistency': 'ONE'} > {code} > I took the queries and ran them through CQLSH to see the difference in time. > A linear relationship is seen based on where the tokenRange being queried is > starting with only 2 second for queries near the beginning of the full token > spectrum and over 12 seconds at the end of the spectrum. > The question is, can this behavior be improved? or should we not recommend > using secondary indexes with Analytics workloads? -- This message was sent by Atlassian JIRA (v6.3.4#6332)