Thanks Jirka!
From: [email protected]
Subject: Re: How to speed up SELECT * query in Cassandra
Hi,
here are some snippets of code in scala which should get you started.
Jirka H.
loop { lastRow => val query = lastRow match { case
Some(row) => nextPageQuery(row, upperLimit) case None =>
initialQuery(lowerLimit) } session.execute(query).all }
private def nextPageQuery(row: Row,
upperLimit: String): String = { val tokenPart =
"token(%s) > token(0x%s) and token(%s) < %s".format(rowKeyName,
hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)
basicQuery.format(tokenPart) }
private def initialQuery(lowerLimit: String): String
= { val tokenPart = "token(%s) >= %s".format(rowKeyName,
lowerLimit) basicQuery.format(tokenPart) } private def
calculateRanges: (BigDecimal, BigDecimal,
IndexedSeq[(BigDecimal, BigDecimal)]) = { tokenRange match {
case Some((start, end)) => Logger.info("Token range given: {}", "<"
+ start.underlying.toPlainString + ", " +
end.underlying.toPlainString + ">") val tokenSpaceSize = end - start
val rangeSize = tokenSpaceSize / concurrency val ranges = for (i
<- 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1)
* rangeSize)) (tokenSpaceSize, rangeSize, ranges)
case None => val tokenSpaceSize = partitioner.max -
partitioner.min val rangeSize = tokenSpaceSize / concurrency val
ranges = for (i <- 0 until concurrency) yield (partitioner.min + (i *
rangeSize), partitioner.min + ((i + 1) * rangeSize))
(tokenSpaceSize, rangeSize, ranges) } }
private val basicQuery = {
"select %s, %s, %s, writetime(%s) from %s where %s%s limit
%d%s".format( rowKeyName, columnKeyName,
columnValueName, columnValueName, columnFamily, "%s",
// template whereCondition, pageSize, if
(cqlAllowFiltering) " allow filtering" else "" ) }
case object Murmur3
extends Partitioner { override
val min = BigDecimal(-2).pow(63)
override val max = BigDecimal(2).pow(63)
- 1 } case object
Random extends Partitioner {
override val min = BigDecimal(0)
override val max =
BigDecimal(2).pow(127) - 1 }
On 02/11/2015 02:21 PM, Ja Sam wrote:
Your answer looks very promising
How do you calculate start and stop?
On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky <[email protected]>
wrote:
The fastest way I am aware of is to do the queries in parallel
to
multiple cassandra nodes and make sure that you only ask
them for keys
they are responsible for. Otherwise, the node needs to
resend your query
which is much slower and creates unnecessary objects (and
thus GC pressure).
You can manually take advantage of the token range
information, if the
driver does not get this into account for you. Then, you can
play with
concurrency and batch size of a single query against one
node.
Basically, what you/driver should do is to transform the
query to series
of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
I will need to look up the actual code, but the idea should
be clear :)
Jirka H.
On 02/11/2015 11:26 AM, Ja Sam wrote:
> Is there a simple way (or even a complicated one)
how can I speed up
> SELECT * FROM [table] query?
> I need to get all rows form one table every day. I
split tables, and
> create one for each day, but still query is quite
slow (200 millions
> of records)
>
> I was thinking about run this query in parallel,
but I don't know if
> it is possible