Re: How to speed up SELECT * query in Cassandra

Marcelo Valle (BLOOMBERG/ LONDON) Thu, 12 Feb 2015 05:56:49 -0800

Thanks Jirka!

From: user@cassandra.apache.org 
Subject: Re: How to speed up SELECT * query in Cassandra


                  Hi,
    
    here are some snippets of code in scala which should get you     started.
    
    Jirka H.
    
         loop {     lastRow =>     val query =       lastRow match {     case 
Some(row)       => nextPageQuery(row, upperLimit)     case None       => 
initialQuery(lowerLimit)     }              session.execute(query).all     }
      
    
                 private def nextPageQuery(row:         Row,         
upperLimit: String): String = {       val           tokenPart =         
"token(%s) > token(0x%s) and token(%s)           < %s".format(rowKeyName,       
  hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)       
basicQuery.format(tokenPart)       }
        
      
          private def initialQuery(lowerLimit:       String):       String      
 = {     val tokenPart =       "token(%s) >= %s".format(rowKeyName,       
lowerLimit)          basicQuery.format(tokenPart)     }         private def 
calculateRanges:       (BigDecimal,       BigDecimal,       
IndexedSeq[(BigDecimal,       BigDecimal)])       = {     tokenRange match {    
 case Some((start,       end)) =>     Logger.info("Token range given: {}", "<" 
+       start.underlying.toPlainString + ", "       + 
end.underlying.toPlainString + ">")     val tokenSpaceSize =       end - start  
   val rangeSize =       tokenSpaceSize / concurrency     val ranges = for (i 
<- 0 until concurrency) yield (start + (i *       rangeSize), start + ((i + 1)  
     * rangeSize))              (tokenSpaceSize, rangeSize, ranges)         
case None       =>     val tokenSpaceSize =       partitioner.max - 
partitioner.min     val rangeSize =       tokenSpaceSize / concurrency     val 
ranges = for (i <- 0 until concurrency) yield (partitioner.min +       (i * 
rangeSize), partitioner.min + ((i + 1) * rangeSize))              
(tokenSpaceSize, rangeSize, ranges)     }     }    
    
                      private           val basicQuery =           {         
"select %s, %s, %s,             writetime(%s) from %s where %s%s limit 
%d%s".format(         rowKeyName,         columnKeyName,         
columnValueName,         columnValueName,         columnFamily,         "%s", 
// template         whereCondition,         pageSize,         if           
(cqlAllowFiltering) "             allow filtering" else ""         )         }
          
        
                         case             object               Murmur3          
     extends               Partitioner               {           override       
      val               min =             BigDecimal(-2).pow(63)           
override             val               max =             BigDecimal(2).pow(63)  
           - 1           }                     case             object          
     Random               extends               Partitioner               {     
      override             val               min =             BigDecimal(0)    
       override             val               max =             
BigDecimal(2).pow(127)             - 1           }          
          
          
             
On 02/11/2015 02:21 PM, Ja Sam wrote:
    
Your answer looks very promising         

        
 How do you calculate start and stop?

        
On Wed, Feb 11, 2015 at 12:09 PM, Jiri           Horky <ho...@avast.com>        
   wrote:
          
The             fastest way I am aware of is to do the queries in parallel      
       to
            multiple cassandra nodes and make sure that you only ask            
 them for keys
            they are responsible for. Otherwise, the node needs to             
resend your query
            which is much slower and creates unnecessary objects (and           
  thus GC pressure).
            
            You can manually take advantage of the token range             
information, if the
            driver does not get this into account for you. Then, you can        
     play with
            concurrency and batch size of a single query against one            
 node.
            Basically, what you/driver should do is to transform the            
 query to series
            of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
            
            I will need to look up the actual code, but the idea should         
    be clear :)
            
            Jirka H.
            

                
                On 02/11/2015 11:26 AM, Ja Sam wrote:
                > Is there a simple way (or even a complicated one)             
    how can I speed up
                > SELECT * FROM [table] query?
                > I need to get all rows form one table every day. I            
     split tables, and
                > create one for each day, but still query is quite             
    slow (200 millions
                > of records)
                >
                > I was thinking about run this query in parallel,              
   but I don't know if
                > it is possible

Re: How to speed up SELECT * query in Cassandra

Reply via email to