Thanks Jirka! From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra
Hi, here are some snippets of code in scala which should get you started. Jirka H. loop { lastRow => val query = lastRow match { case Some(row) => nextPageQuery(row, upperLimit) case None => initialQuery(lowerLimit) } session.execute(query).all } private def nextPageQuery(row: Row, upperLimit: String): String = { val tokenPart = "token(%s) > token(0x%s) and token(%s) < %s".format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit) basicQuery.format(tokenPart) } private def initialQuery(lowerLimit: String): String = { val tokenPart = "token(%s) >= %s".format(rowKeyName, lowerLimit) basicQuery.format(tokenPart) } private def calculateRanges: (BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) = { tokenRange match { case Some((start, end)) => Logger.info("Token range given: {}", "<" + start.underlying.toPlainString + ", " + end.underlying.toPlainString + ">") val tokenSpaceSize = end - start val rangeSize = tokenSpaceSize / concurrency val ranges = for (i <- 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1) * rangeSize)) (tokenSpaceSize, rangeSize, ranges) case None => val tokenSpaceSize = partitioner.max - partitioner.min val rangeSize = tokenSpaceSize / concurrency val ranges = for (i <- 0 until concurrency) yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) * rangeSize)) (tokenSpaceSize, rangeSize, ranges) } } private val basicQuery = { "select %s, %s, %s, writetime(%s) from %s where %s%s limit %d%s".format( rowKeyName, columnKeyName, columnValueName, columnValueName, columnFamily, "%s", // template whereCondition, pageSize, if (cqlAllowFiltering) " allow filtering" else "" ) } case object Murmur3 extends Partitioner { override val min = BigDecimal(-2).pow(63) override val max = BigDecimal(2).pow(63) - 1 } case object Random extends Partitioner { override val min = BigDecimal(0) override val max = BigDecimal(2).pow(127) - 1 } On 02/11/2015 02:21 PM, Ja Sam wrote: Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky <ho...@avast.com> wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)". I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: > Is there a simple way (or even a complicated one) how can I speed up > SELECT * FROM [table] query? > I need to get all rows form one table every day. I split tables, and > create one for each day, but still query is quite slow (200 millions > of records) > > I was thinking about run this query in parallel, but I don't know if > it is possible