Re: How to speed up SELECT * query in Cassandra

Jiri Horky Wed, 11 Feb 2015 23:33:01 -0800

Hi,

here are some snippets of code in scala which should get you started.


Jirka H.

loop {lastRow =>val query = lastRow match {case Some(row) =>
nextPageQuery(row, upperLimit)case None =>
initialQuery(lowerLimit)}session.execute(query).all}


private def nextPageQuery(row: Row, upperLimit: String): String = {val
tokenPart = "token(%s) > token(0x%s) and token(%s) <
%s".format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName,
upperLimit)basicQuery.format(tokenPart)}


private def initialQuery(lowerLimit: String): String = {val tokenPart =
"token(%s) >= %s".format(rowKeyName,
lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges:
(BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) =
{tokenRange match {case Some((start, end)) =>Logger.info("Token range
given: {}", "<" + start.underlying.toPlainString + ", " +
end.underlying.toPlainString + ">")val tokenSpaceSize = end - startval
rangeSize = tokenSpaceSize / concurrencyval ranges = for (i <- 0 until
concurrency) yield (start + (i * rangeSize), start + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =>val
tokenSpaceSize = partitioner.max - partitioner.minval rangeSize =
tokenSpaceSize / concurrencyval ranges = for (i <- 0 until concurrency)
yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)}}

private val basicQuery = {"select %s, %s, %s, writetime(%s) from %s
where %s%s limit
%d%s".format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,"%s",
// templatewhereCondition,pageSize,if (cqlAllowFiltering) " allow
filtering" else "")}


case object Murmur3 extends Partitioner {override val min =
BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case
object Random extends Partitioner {override val min =
BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1}


On 02/11/2015 02:21 PM, Ja Sam wrote:
> Your answer looks very promising
>
>  How do you calculate start and stop?
>
> On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky <ho...@avast.com
> <mailto:ho...@avast.com>> wrote:
>
>     The fastest way I am aware of is to do the queries in parallel to
>     multiple cassandra nodes and make sure that you only ask them for keys
>     they are responsible for. Otherwise, the node needs to resend your
>     query
>     which is much slower and creates unnecessary objects (and thus GC
>     pressure).
>
>     You can manually take advantage of the token range information, if the
>     driver does not get this into account for you. Then, you can play with
>     concurrency and batch size of a single query against one node.
>     Basically, what you/driver should do is to transform the query to
>     series
>     of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
>
>     I will need to look up the actual code, but the idea should be
>     clear :)
>
>     Jirka H.
>
>
>     On 02/11/2015 11:26 AM, Ja Sam wrote:
>     > Is there a simple way (or even a complicated one) how can I speed up
>     > SELECT * FROM [table] query?
>     > I need to get all rows form one table every day. I split tables, and
>     > create one for each day, but still query is quite slow (200 millions
>     > of records)
>     >
>     > I was thinking about run this query in parallel, but I don't know if
>     > it is possible
>
>

Re: How to speed up SELECT * query in Cassandra

Reply via email to