Re: Write timeout on other nodes when joing a new node (in new DC)

2015-10-20 Thread Jiri Horky
Hi all,

so after deep investigation, we found out that this is this problem

https://issues.apache.org/jira/browse/CASSANDRA-8058

Jiri Horky

On 10/20/2015 12:00 PM, Jiri Horky wrote:
> Hi all,
>
> we are experiencing a strange behavior when we are trying to bootstrap a
> new node. The problem is that the Recent Write Latency goes to 2s on all
> the other Cassandra nodes (which are receiving user traffic), which
> corresponds to our setting of "write_request_timeout_in_ms: 2000".
>
> We use Cassandra 2.0.10 and trying to convert to vnodes and increase a
> replication factor. So we are adding a new node in new DC (marked as
> DCXA) as the only node in new DC with replication factor 3. The reason
> for higher RF is that we will be converting another 2 existing servers
> to new DC (vnodes) and we want them to get all the data.
>
> The replication settings look like this:
> ALTER KEYSPACE slw WITH replication = {
>   'class': 'NetworkTopologyStrategy',
>   'DC4': '1',
>   'DC5': '1',
>   'DC2': '1',
>   'DC3': '1',
>   'DC0': '1',
>   'DC1': '1',
>   'DC0A': '3',
>   'DC1A': '3',
>   'DC2A': '3',
>   'DC3A': '3',
>   'DC4A': '3',
>   'DC5A': '3'
> };
>
> We were adding the nodes to DC0A->DC4A without any effects on existing
> nodes (DCX without A). When we are trying to add DC5A, the abovemention
> problem happens, 100% reproducibly.
>
> I tried to increase number of concurrent_writers from 32 to 128 on the
> old nodes, also tried to increase number of flush writers, both  with no
> effect. The strange thing is that the load, CPU usage, GC, network
> throughput - everything is fine on the old nodes which are reporting 2s
> of write latency. Nodetool tpstats does not show any blocked/pending
> operations.
>
> I think I must be hitting some limit (because of overall of replicas?)
> somewhere.
>
> Any input would be greatly appreciated.
>
> Thanks
> Jirka H.
>



Write timeout on other nodes when joing a new node (in new DC)

2015-10-20 Thread Jiri Horky
Hi all,

we are experiencing a strange behavior when we are trying to bootstrap a
new node. The problem is that the Recent Write Latency goes to 2s on all
the other Cassandra nodes (which are receiving user traffic), which
corresponds to our setting of "write_request_timeout_in_ms: 2000".

We use Cassandra 2.0.10 and trying to convert to vnodes and increase a
replication factor. So we are adding a new node in new DC (marked as
DCXA) as the only node in new DC with replication factor 3. The reason
for higher RF is that we will be converting another 2 existing servers
to new DC (vnodes) and we want them to get all the data.

The replication settings look like this:
ALTER KEYSPACE slw WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'DC4': '1',
  'DC5': '1',
  'DC2': '1',
  'DC3': '1',
  'DC0': '1',
  'DC1': '1',
  'DC0A': '3',
  'DC1A': '3',
  'DC2A': '3',
  'DC3A': '3',
  'DC4A': '3',
  'DC5A': '3'
};

We were adding the nodes to DC0A->DC4A without any effects on existing
nodes (DCX without A). When we are trying to add DC5A, the abovemention
problem happens, 100% reproducibly.

I tried to increase number of concurrent_writers from 32 to 128 on the
old nodes, also tried to increase number of flush writers, both  with no
effect. The strange thing is that the load, CPU usage, GC, network
throughput - everything is fine on the old nodes which are reporting 2s
of write latency. Nodetool tpstats does not show any blocked/pending
operations.

I think I must be hitting some limit (because of overall of replicas?)
somewhere.

Any input would be greatly appreciated.

Thanks
Jirka H.



Re: Availability testing of Cassandra nodes

2015-04-09 Thread Jiri Horky
Hi Jack,

it seems there is a some misunderstanding. There are two things. One is
that the Cassandra works for application, which may (and should) be true
even if some of the nodes are actually down. The other thing is that
even in this case you want to be notified that there are faulty
Cassandra nodes.

Now I am trying to tackle the later case, I am not having issues with
how client-side load balancing works.

Jirka H.

On 04/09/2015 07:15 AM, Ajay wrote:
> Adding Java driver forum.
>
> Even we like to know more on this.
>
> -
> Ajay
>
> On Wed, Apr 8, 2015 at 8:15 PM, Jack Krupansky
> mailto:jack.krupan...@gmail.com>> wrote:
>
> Just a couple of quick comments:
>
> 1. The driver is supposed to be doing availability and load
> balancing already.
> 2. If your cluster is lightly loaded, it isn't necessary to be so
> precise with load balancing.
> 3. If your cluster is heavily loaded, it won't help. Solution is
> to expand your cluster so that precise balancing of requests
> (beyond what the driver does) is not required.
>
> Is there anything special about your use case that you feel is
> worth the extra treatment?
>
> If you are having problems with the driver balancing requests and
> properly detecting available nodes or see some room for
> improvement, make sure to the issues so that they can be fixed.
>
>
> -- Jack Krupansky
>
> On Wed, Apr 8, 2015 at 10:31 AM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> Hi all,
>
> we are thinking of how to best proceed with availability
> testing of
> Cassandra nodes. It is becoming more and more apparent that it
> is rather
> complex task. We thought that we should try to read and write
> to each
> cassandra node to "monitoring" keyspace with a unique value
> with low
> TTL. This helps to find an issue but it also triggers flapping of
> unaffected hosts, as the key of the value which is beining
> inserted
> sometimes belongs to an affected host and sometimes not. Now,
> we could
> calculate the right value to insert so we can be sure it will
> hit the
> host we are connecting to, but then, you have replication
> factor and
> consistency level, so you can not be really sure that it
> actually tests
> ability of the given host to write values.
>
> So we ended up thinking that the best approach is to connect
> to each
> individual host, read some system keyspace (which might be on a
> different disk drive...), which should be local, and then
> check several
> JMX values that could indicate an error + JVM statitics (full
> heap, gc
> overhead). Moreover, we will more monitor our applications
> that are
> using cassandra (with mostly datastax driver) and try to get
> fail node
> information from them.
>
> How others do the testing?
>
> Jirka H.
>
>
>



Availability testing of Cassandra nodes

2015-04-08 Thread Jiri Horky
Hi all,

we are thinking of how to best proceed with availability testing of
Cassandra nodes. It is becoming more and more apparent that it is rather
complex task. We thought that we should try to read and write to each
cassandra node to "monitoring" keyspace with a unique value with low
TTL. This helps to find an issue but it also triggers flapping of
unaffected hosts, as the key of the value which is beining inserted
sometimes belongs to an affected host and sometimes not. Now, we could
calculate the right value to insert so we can be sure it will hit the
host we are connecting to, but then, you have replication factor and
consistency level, so you can not be really sure that it actually tests
ability of the given host to write values.

So we ended up thinking that the best approach is to connect to each
individual host, read some system keyspace (which might be on a
different disk drive...), which should be local, and then check several
JMX values that could indicate an error + JVM statitics (full heap, gc
overhead). Moreover, we will more monitor our applications that are
using cassandra (with mostly datastax driver) and try to get fail node
information from them.

How others do the testing?

Jirka H.


Re: How to speed up SELECT * query in Cassandra

2015-02-16 Thread Jiri Horky
Hi,

thanks for the reference, I really appreciate that you shared your
experience.

Could you please share how much data you store on the cluster and what
is HW configuration of the nodes? I am really impressed that you are
able to read 100M records in ~4minutes on 4 nodes. It makes something
like 100k reads per node, which is something we are quite far away from.

It leads me to question, whether reading from Spark goes through
Cassandra's JVM and thus go through normal read path, or if it reads the
sstables directly from disks sequentially and possibly filters out
old/tombstone values by itself? If it is the latter, then I understand
why it can perform that well.

Thank you.

Regards
Jirka H.

On 02/14/2015 09:17 PM, mck wrote:
> Jirka,
>
>> But I am really interested how it can work well with Spark/Hadoop where
>> you basically needs to read all the data as well (as far as I understand
>> that).
>
> I can't give you any benchmarking between technologies (nor am i
> particularly interested in getting involved in such a discussion) but i
> can share our experiences with Cassandra, Hadoop, and Spark, over the
> past 4+ years, and hopefully assure you that Cassandra+Spark is a smart
> choice.
>
> On a four node cluster we were running 5000+ small hadoop jobs each day
> each finishing within two minutes, often within one minute, resulting in
> (give or take) a billion records read and 150 millions records written
> from and to c*.
> These small jobs are incrementally processing on limited partition key
> sets each time. These jobs are primarily reading data from a "raw events
> store" that has a ttl of 3 months and 22+Gb of tombstones a day (reads
> over old partition keys are rare).
>
> We also run full-table-scan jobs and have never come across any issues
> particular to that. There are hadoop map/reduce settings to increase
> durability if you have tables with troublesome partition keys.
>
> This is also a cluster that serves requests to web applications that
> need low latency.
>
> We recently wrote a spark job that does full table scans over 100
> million+ rows, involves a handful of stages (two tables, 9 maps, 4
> reduce, and 2 joins), and writes back to a new table 5 millions rows.
> This job runs in ~260 seconds.
>
> Spark is becoming a natural complement to schema evolution for
> cassandra, something you'll want to do to keep your schema optimised
> against your read request patterns, even little things like switching
> cluster keys around. 
>
> With any new technology hitting some hurdles (especially if you go
> wondering outside recommended practices) will of course be part of the
> game, but that said I've only had positive experiences with this
> community's ability to help out (and do so quickly).
>
> Starting from scratch i'd use Spark (on scala) over Hadoop no questions
> asked. 
> Otherwise Cassandra has always been our 'big data' platform,
> hadoop/spark is just an extra tool on top.
> We've never kept data in hdfs and are very grateful for having made that
> choice.
>
> ~mck
>
> ref
> https://prezi.com/vt98oob9fvo4/cassandra-summit-cassandra-and-hadoop-at-finnno/



Re: High GC activity on node with 4TB on data

2015-02-12 Thread Jiri Horky
Number of cores: 2x6Cores x 2(HT).

I do agree with you that the the hardware is certainly overestimated for
just one Cassandra, but we got a very good price since we ordered
several 10s of the same nodes for a different project. That's why we use
for multiple cassandra instances.

Jirka H.

On 02/12/2015 04:18 PM, Eric Stevens wrote:
> > each node has 256G of memory, 24x1T drives, 2x Xeon CPU
>
> I don't have first hand experience running Cassandra on such massive
> hardware, but it strikes me that these machines are dramatically
> oversized to be good candidates for Cassandra (though I wonder how
> many cores are in those CPUs; I'm guessing closer to 18 than 2 based
> on the other hardware).
>
> A larger cluster of smaller hardware would be a much better shape for
> Cassandra.  Or several clusters of smaller hardware since you're
> running multiple instances on this hardware - best practices have one
> instance per host no matter the hardware size.
>
> On Thu, Feb 12, 2015 at 12:36 AM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> Hi Chris,
>
> On 02/09/2015 04:22 PM, Chris Lohfink wrote:
>>  - number of tombstones - how can I reliably find it out?
>> https://github.com/spotify/cassandra-opstools
>> https://github.com/cloudian/support-tools
> thanks.
>>
>> If not getting much compression it may be worth trying to disable
>> it, it may contribute but its very unlikely that its the cause of
>> the gc pressure itself.
>>
>> 7000 sstables but STCS? Sounds like compactions couldn't keep
>> up.  Do you have a lot of pending compactions (nodetool)?  You
>> may want to increase your compaction throughput (nodetool) to see
>> if you can catch up a little, it would cause a lot of heap
>> overhead to do reads with that many.  May even need to take more
>> drastic measures if it cant catch back up.
> I am sorry, I was wrong. We actually do use LCS (the switch was
> done recently). There are almost none pending compaction. We have
> increased the size sstable to 768M, so it should help as as well.
>
>>
>> May also be good to check `nodetool cfstats` for very wide
>> partitions.  
> There are basically none, this is fine.
>
> It seems that the problem really comes from having so much data in
> so many sstables, so
> org.apache.cassandra.io.compress.CompressedRandomAccessReader
> classes consumes more memory than 0.75*HEAP_SIZE, which triggers
> the CMS over and over.
>
> We have turned off the compression and so far, the situation seems
> to be fine.
>
> Cheers
> Jirka H.
>
>
>>
>> Theres a good chance if under load and you have over 8gb heap
>> your GCs could use tuning.  The bigger the nodes the more manual
>> tweaking it will require to get the most out of
>> them https://issues.apache.org/jira/browse/CASSANDRA-8150 also
>> has some ideas.
>>
>> Chris
>>
>> On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky > <mailto:ho...@avast.com>> wrote:
>>
>> Hi all,
>>
>> thank you all for the info.
>>
>> To answer the questions:
>>  - we have 2 DCs with 5 nodes in each, each node has 256G of
>> memory, 24x1T drives, 2x Xeon CPU - there are multiple
>> cassandra instances running for different project. The node
>> itself is powerful enough.
>>  - there 2 keyspaces, one with 3 replicas per DC, one with 1
>> replica per DC (because of amount of data and because it
>> serves more or less like a cache)
>>  - there are about 4k/s Request-response, 3k/s Read and 2k/s
>> Mutation requests  - numbers are sum of all nodes
>>  - we us STCS (LCS would be quite IO have for this amount of
>> data)
>>  - number of tombstones - how can I reliably find it out?
>>  - the biggest CF (3.6T per node) has 7000 sstables
>>
>> Now, I understand that the best practice for Cassandra is to
>> run "with the minimum size of heap which is enough" which for
>> this case we thought is about 12G - there is always 8G
>> consumbed by the SSTable readers. Also, I though that high
>> number of tombstones create pressure in the new space (which
>> can then cause pressure in old space as well), but this is
>> not what we are seeing. We see continuous GC activity in Old
>> generation only.
>>
>> Also, I noti

Re: High GC activity on node with 4TB on data

2015-02-11 Thread Jiri Horky
Hi Chris,

On 02/09/2015 04:22 PM, Chris Lohfink wrote:
>  - number of tombstones - how can I reliably find it out?
> https://github.com/spotify/cassandra-opstools
> https://github.com/cloudian/support-tools
thanks.
>
> If not getting much compression it may be worth trying to disable it,
> it may contribute but its very unlikely that its the cause of the gc
> pressure itself.
>
> 7000 sstables but STCS? Sounds like compactions couldn't keep up.  Do
> you have a lot of pending compactions (nodetool)?  You may want to
> increase your compaction throughput (nodetool) to see if you can catch
> up a little, it would cause a lot of heap overhead to do reads with
> that many.  May even need to take more drastic measures if it cant
> catch back up.
I am sorry, I was wrong. We actually do use LCS (the switch was done
recently). There are almost none pending compaction. We have increased
the size sstable to 768M, so it should help as as well.

>
> May also be good to check `nodetool cfstats` for very wide partitions.  
There are basically none, this is fine.

It seems that the problem really comes from having so much data in so
many sstables, so
org.apache.cassandra.io.compress.CompressedRandomAccessReader classes
consumes more memory than 0.75*HEAP_SIZE, which triggers the CMS over
and over.

We have turned off the compression and so far, the situation seems to be
fine.

Cheers
Jirka H.

>
> Theres a good chance if under load and you have over 8gb heap your GCs
> could use tuning.  The bigger the nodes the more manual tweaking it
> will require to get the most out of
> them https://issues.apache.org/jira/browse/CASSANDRA-8150 also has
> some ideas.
>
> Chris
>
> On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> Hi all,
>
> thank you all for the info.
>
> To answer the questions:
>  - we have 2 DCs with 5 nodes in each, each node has 256G of
> memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra
> instances running for different project. The node itself is
> powerful enough.
>  - there 2 keyspaces, one with 3 replicas per DC, one with 1
> replica per DC (because of amount of data and because it serves
> more or less like a cache)
>  - there are about 4k/s Request-response, 3k/s Read and 2k/s
> Mutation requests  - numbers are sum of all nodes
>  - we us STCS (LCS would be quite IO have for this amount of data)
>  - number of tombstones - how can I reliably find it out?
>  - the biggest CF (3.6T per node) has 7000 sstables
>
> Now, I understand that the best practice for Cassandra is to run
> "with the minimum size of heap which is enough" which for this
> case we thought is about 12G - there is always 8G consumbed by the
> SSTable readers. Also, I though that high number of tombstones
> create pressure in the new space (which can then cause pressure in
> old space as well), but this is not what we are seeing. We see
> continuous GC activity in Old generation only.
>
> Also, I noticed that the biggest CF has Compression factor of 0.99
> which basically means that the data come compressed already. Do
> you think that turning off the compression should help with memory
> consumption?
>
> Also, I think that tuning CMSInitiatingOccupancyFraction=75 might
> help here, as it seems that 8G is something that Cassandra needs
> for bookkeeping this amount of data and that this was sligtly
> above the 75% limit which triggered the CMS again and again.
>
> I will definitely have a look at the presentation.
>
> Regards
> Jiri Horky
>
>
> On 02/08/2015 10:32 PM, Mark Reddy wrote:
>> Hey Jiri, 
>>
>> While I don't have any experience running 4TB nodes (yet), I
>> would recommend taking a look at a presentation by Arron Morton
>> on large
>> nodes: 
>> http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/
>> to see if you can glean anything from that.
>>
>> I would note that at the start of his talk he mentions that in
>> version 1.2 we can now talk about nodes around 1 - 3 TB in size,
>> so if you are storing anything more than that you are getting
>> into very specialised use cases.
>>
>> If you could provide us with some more information about your
>> cluster setup (No. of CFs, read/write patterns, do you delete /
>> update often, etc.) that may help in getting you to a better place.
>>
>>
>> Regards,
>> Mark
>>
>> On 8 February 2015 at 21:10, Kevin Burton 

Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Hi,

here are some snippets of code in scala which should get you started.

Jirka H.

loop {lastRow =>val query = lastRow match {case Some(row) =>
nextPageQuery(row, upperLimit)case None =>
initialQuery(lowerLimit)}session.execute(query).all}


private def nextPageQuery(row: Row, upperLimit: String): String = {val
tokenPart = "token(%s) > token(0x%s) and token(%s) <
%s".format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName,
upperLimit)basicQuery.format(tokenPart)}


private def initialQuery(lowerLimit: String): String = {val tokenPart =
"token(%s) >= %s".format(rowKeyName,
lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges:
(BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) =
{tokenRange match {case Some((start, end)) =>Logger.info("Token range
given: {}", "<" + start.underlying.toPlainString + ", " +
end.underlying.toPlainString + ">")val tokenSpaceSize = end - startval
rangeSize = tokenSpaceSize / concurrencyval ranges = for (i <- 0 until
concurrency) yield (start + (i * rangeSize), start + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =>val
tokenSpaceSize = partitioner.max - partitioner.minval rangeSize =
tokenSpaceSize / concurrencyval ranges = for (i <- 0 until concurrency)
yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) *
rangeSize))(tokenSpaceSize, rangeSize, ranges)}}

private val basicQuery = {"select %s, %s, %s, writetime(%s) from %s
where %s%s limit
%d%s".format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,"%s",
// templatewhereCondition,pageSize,if (cqlAllowFiltering) " allow
filtering" else "")}


case object Murmur3 extends Partitioner {override val min =
BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case
object Random extends Partitioner {override val min =
BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1}


On 02/11/2015 02:21 PM, Ja Sam wrote:
> Your answer looks very promising
>
>  How do you calculate start and stop?
>
> On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> The fastest way I am aware of is to do the queries in parallel to
> multiple cassandra nodes and make sure that you only ask them for keys
> they are responsible for. Otherwise, the node needs to resend your
> query
> which is much slower and creates unnecessary objects (and thus GC
> pressure).
>
> You can manually take advantage of the token range information, if the
> driver does not get this into account for you. Then, you can play with
> concurrency and batch size of a single query against one node.
> Basically, what you/driver should do is to transform the query to
> series
> of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".
>
> I will need to look up the actual code, but the idea should be
> clear :)
>
> Jirka H.
>
>
> On 02/11/2015 11:26 AM, Ja Sam wrote:
> > Is there a simple way (or even a complicated one) how can I speed up
> > SELECT * FROM [table] query?
> > I need to get all rows form one table every day. I split tables, and
> > create one for each day, but still query is quite slow (200 millions
> > of records)
> >
> > I was thinking about run this query in parallel, but I don't know if
> > it is possible
>
>



Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
Well, I always wondered how Cassandra can by used in Hadoop-like
environment where you basically need to do full table scan.

I need to say that our experience is that cassandra is perfect for
writing, reading specific values by key, but definitely not for reading
all of the data out of it. Some of our projects found out that doing
that with a not trivial in a timely manner is close to impossible in
many situations. We are slowly moving to storing the data in HDFS and
possibly reprocess them on a daily bases for such usecases (statistics).

This is nothing against Cassandra, it can not be perfect for everything.
But I am really interested how it can work well with Spark/Hadoop where
you basically needs to read all the data as well (as far as I understand
that).

Jirka H.

On 02/11/2015 01:51 PM, DuyHai Doan wrote:
> "The very nature of cassandra's distributed nature vs partitioning
> data on hadoop makes spark on hdfs actually fasted than on cassandra"
>
> Prove it. Did you ever have a look into the source code of the
> Spark/Cassandra connector to see how data locality is achieved before
> throwing out such statement ?
>
> On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON)
> mailto:mvallemil...@bloomberg.net>> wrote:
>
> > cassandra makes a very poor datawarehouse ot long term time series store
>
> Really? This is not the impression I have... I think Cassandra is
> good to store larges amounts of data and historical information,
> it's only not good to store temporary data.
> Netflix has a large amount of data and it's all stored in
> Cassandra, AFAIK.
>
> > The very nature of cassandra's distributed nature vs partitioning data 
> on hadoop makes spark on hdfs
> actually fasted than on cassandra.
>
> I am not sure about the current state of Spark support for
> Cassandra, but I guess if you create a map reduce job, the
> intermediate map results will be still stored in HDFS, as it
> happens to hadoop, is this right? I think the problem with Spark +
> Cassandra or with Hadoop + Cassandra is that the hard part spark
> or hadoop does, the shuffling, could be done out of the box with
> Cassandra, but no one takes advantage on that. What if a map /
> reduce job used a temporary CF in Cassandra to store intermediate
> results?
>
> From: user@cassandra.apache.org 
> Subject: Re: How to speed up SELECT * query in Cassandra
>
> I use spark with cassandra, and you dont need DSE.
>
> I see a lot of people ask this same question below (how do I
> get a lot of data out of cassandra?), and my question is
> always, why arent you updating both places at once?
>
> For example, we use hadoop and cassandra in conjunction with
> each other, we use a message bus to store every event in both,
> aggregrate in both, but only keep current data in cassandra
> (cassandra makes a very poor datawarehouse ot long term time
> series store) and then use services to process queries that
> merge data from hadoop and cassandra.  
>
> Also, spark on hdfs gives more flexibility in terms of large
> datasets and performance.  The very nature of cassandra's
> distributed nature vs partitioning data on hadoop makes spark
> on hdfs actually fasted than on cassandra
>
>
>
> -- 
> *Colin Clark* 
> +1 612 859 6129 
> Skype colin.p.clark
>
> On Feb 11, 2015, at 4:49 AM, Jens Rantil  > wrote:
>
>>
>> On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/
>> LONDON) > > wrote:
>>
>> If you use Cassandra enterprise, you can use hive, AFAIK.
>>
>>
>> Even better, you can use Spark/Shark with DSE.
>>
>> Cheers,
>> Jens
>>
>>
>> -- 
>> Jens Rantil
>> Backend engineer
>> Tink AB
>>
>> Email: jens.ran...@tink.se 
>> Phone: +46 708 84 18 32
>> Web: www.tink.se 
>>
>> Facebook  Linkedin
>> 
>> 
>>  Twitter
>> 
>
>
>



Re: How to speed up SELECT * query in Cassandra

2015-02-11 Thread Jiri Horky
The fastest way I am aware of is to do the queries in parallel to
multiple cassandra nodes and make sure that you only ask them for keys
they are responsible for. Otherwise, the node needs to resend your query
which is much slower and creates unnecessary objects (and thus GC pressure).

You can manually take advantage of the token range information, if the
driver does not get this into account for you. Then, you can play with
concurrency and batch size of a single query against one node.
Basically, what you/driver should do is to transform the query to series
of "SELECT * FROM TABLE WHERE TOKEN IN (start, stop)".

I will need to look up the actual code, but the idea should be clear :)

Jirka H.


On 02/11/2015 11:26 AM, Ja Sam wrote:
> Is there a simple way (or even a complicated one) how can I speed up
> SELECT * FROM [table] query?
> I need to get all rows form one table every day. I split tables, and
> create one for each day, but still query is quite slow (200 millions
> of records)
>
> I was thinking about run this query in parallel, but I don't know if
> it is possible



Re: High GC activity on node with 4TB on data

2015-02-09 Thread Jiri Horky
Hi all,

thank you all for the info.

To answer the questions:
 - we have 2 DCs with 5 nodes in each, each node has 256G of memory,
24x1T drives, 2x Xeon CPU - there are multiple cassandra instances
running for different project. The node itself is powerful enough.
 - there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per
DC (because of amount of data and because it serves more or less like a
cache)
 - there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation
requests  - numbers are sum of all nodes
 - we us STCS (LCS would be quite IO have for this amount of data)
 - number of tombstones - how can I reliably find it out?
 - the biggest CF (3.6T per node) has 7000 sstables

Now, I understand that the best practice for Cassandra is to run "with
the minimum size of heap which is enough" which for this case we thought
is about 12G - there is always 8G consumbed by the SSTable readers.
Also, I though that high number of tombstones create pressure in the new
space (which can then cause pressure in old space as well), but this is
not what we are seeing. We see continuous GC activity in Old generation
only.

Also, I noticed that the biggest CF has Compression factor of 0.99 which
basically means that the data come compressed already. Do you think that
turning off the compression should help with memory consumption?

Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help
here, as it seems that 8G is something that Cassandra needs for
bookkeeping this amount of data and that this was sligtly above the 75%
limit which triggered the CMS again and again.

I will definitely have a look at the presentation.

Regards
Jiri Horky

On 02/08/2015 10:32 PM, Mark Reddy wrote:
> Hey Jiri, 
>
> While I don't have any experience running 4TB nodes (yet), I would
> recommend taking a look at a presentation by Arron Morton on large
> nodes: 
> http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/
> to see if you can glean anything from that.
>
> I would note that at the start of his talk he mentions that in version
> 1.2 we can now talk about nodes around 1 - 3 TB in size, so if you are
> storing anything more than that you are getting into very specialised
> use cases.
>
> If you could provide us with some more information about your cluster
> setup (No. of CFs, read/write patterns, do you delete / update often,
> etc.) that may help in getting you to a better place.
>
>
> Regards,
> Mark
>
> On 8 February 2015 at 21:10, Kevin Burton  <mailto:bur...@spinn3r.com>> wrote:
>
> Do you have a lot of individual tables?  Or lots of small
> compactions?
>
> I think the general consensus is that (at least for Cassandra),
> 8GB heaps are ideal.  
>
> If you have lots of small tables it’s a known anti-pattern (I
> believe) because the Cassandra internals could do a better job on
> handling the in memory metadata representation.
>
> I think this has been improved in 2.0 and 2.1 though so the fact
> that you’re on 1.2.18 could exasperate the issue.  You might want
> to consider an upgrade (though that has its own issues as well).
>
> On Sun, Feb 8, 2015 at 12:44 PM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> Hi all,
>
> we are seeing quite high GC pressure (in old space by CMS GC
> Algorithm)
> on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap
> memory
> (2G for new space). The node runs fine for couple of days when
> the GC
> activity starts to raise and reaches about 15% of the C*
> activity which
> causes dropped messages and other problems.
>
> Taking a look at heap dump, there is about 8G used by
> SSTableReader
> classes in
> org.apache.cassandra.io.compress.CompressedRandomAccessReader.
>
> Is this something expected and we have just reached the limit
> of how
> many data a single Cassandra instance can handle or it is
> possible to
> tune it better?
>
> Regards
> Jiri Horky
>
>
>
>
> -- 
> Founder/CEO Spinn3r.com <http://Spinn3r.com>
> Location: *San Francisco, CA*
> blog:* *http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>



High GC activity on node with 4TB on data

2015-02-08 Thread Jiri Horky
Hi all,

we are seeing quite high GC pressure (in old space by CMS GC Algorithm)
on a node with 4TB of data. It runs C* 1.2.18 with 12G of heap memory
(2G for new space). The node runs fine for couple of days when the GC
activity starts to raise and reaches about 15% of the C* activity which
causes dropped messages and other problems.

Taking a look at heap dump, there is about 8G used by SSTableReader
classes in org.apache.cassandra.io.compress.CompressedRandomAccessReader.

Is this something expected and we have just reached the limit of how
many data a single Cassandra instance can handle or it is possible to
tune it better?

Regards
Jiri Horky


Re: Node down during move

2014-12-23 Thread Jiri Horky
Hi,

just a follow up. We've seen this behavior multiple times now. It seems
that the receiving node loses connectivity to the cluster and thus
thinks that it is the sole online node, whereas the rest of the cluster
thinks that it is the only offline node, really just after the streaming
is over. I am not sure what causes that, but it is reproducible. Restart
of the affected node helps.

We have 3 datacenters (RF=1 for each datacenter) where we are moving the
tokens. This happens only in one of them.

Regards
Jiri Horky


On 12/19/2014 08:20 PM, Jiri Horky wrote:
> Hi list,
>
> we added a new node to existing 8-nodes cluster with C* 1.2.9 without
> vnodes and because we are almost totally out of space, we are shuffling
> the token fone node after another (not in parallel). During one of this
> move operations, the receiving node died and thus the streaming failed:
>
>  WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
> StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
>  INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
> ColumnFamilyStore.java (line 629) Enqueuing flush of
> Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
>  INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
> 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
> ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
> CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
> /X.Y.Z.18:2,5,RMI Runtime]
> java.lang.RuntimeException: java.io.IOException: Broken pipe
> at com.google.common.base.Throwables.propagate(Throwables.java:160)
> at
> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.io.IOException: Broken pipe
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>
> After restart of the receiving node, we tried to perform the move again,
> but it failed with:
>
> Exception in thread "main" java.io.IOException: target token
> 113427455640312821154458202477256070486 is already owned by another node.
> at
> org.apache.cassandra.service.StorageService.move(StorageService.java:2930)
>
> So we tried to move it with a token just 1 higher, to trigger the
> movement. This didn't move anything, but finished successfully:
>
>  INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
> 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
> from /X.Y.Z.18
>
> Now, it is quite improbable that the first streaming was done and it
> died just after copying everything, as the ERROR was the last message
> about streaming in the logs. Is there any way how to make sure the data
> are really moved and thus running nodetool cleanup is safe?
>
> Thank you.
> Jiri Hoky



Node down during move

2014-12-19 Thread Jiri Horky
Hi list,

we added a new node to existing 8-nodes cluster with C* 1.2.9 without
vnodes and because we are almost totally out of space, we are shuffling
the token fone node after another (not in parallel). During one of this
move operations, the receiving node died and thus the streaming failed:

 WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227
StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed
 INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233
ColumnFamilyStore.java (line 629) Enqueuing flush of
Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
 INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line
461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops)
ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246
CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to
/X.Y.Z.18:2,5,RMI Runtime]
java.lang.RuntimeException: java.io.IOException: Broken pipe
at com.google.common.base.Throwables.propagate(Throwables.java:160)
at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: Broken pipe
at sun.nio.ch.FileDispatcherImpl.write0(Native Method)

After restart of the receiving node, we tried to perform the move again,
but it failed with:

Exception in thread "main" java.io.IOException: target token
113427455640312821154458202477256070486 is already owned by another node.
at
org.apache.cassandra.service.StorageService.move(StorageService.java:2930)

So we tried to move it with a token just 1 higher, to trigger the
movement. This didn't move anything, but finished successfully:

 INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line
199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256
from /X.Y.Z.18

Now, it is quite improbable that the first streaming was done and it
died just after copying everything, as the ERROR was the last message
about streaming in the logs. Is there any way how to make sure the data
are really moved and thus running nodetool cleanup is safe?
   
Thank you.
Jiri Hoky


Re: Fail to reconnect to other nodes after intermittent network failure

2014-08-05 Thread Jiri Horky
OK, ticket 7696 [1] created.

Jiri Horky

https://issues.apache.org/jira/browse/CASSANDRA-7696

On 08/05/2014 07:57 PM, Robert Coli wrote:
>
> On Tue, Aug 5, 2014 at 5:48 AM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> What puzzles me is the fact that the authentization apparently started
> to work after the network recovered but the exchange of data did not.
>
> I would like to understand what could caused the problems and how to
> avoid  them in the future.
>
>
> Very few people use SSL and very few people use auth, you have
> probably hit an edge case.
>
> I would file a JIRA with the details you described above.
>
> =Rob
>  



Fail to reconnect to other nodes after intermittent network failure

2014-08-05 Thread Jiri Horky
LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2337)
at
com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2252)
... 20 more
Caused by: org.apache.cassandra.exceptions.ReadTimeoutException:
Operation timed out - received only 0 responses.
at
org.apache.cassandra.service.ReadCallback.get(ReadCallback.java:105)
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:930)
at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:815)
at
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:140)
at org.apache.cassandra.auth.Auth.selectUser(Auth.java:245)
... 29 more

These exception stopped to appear at 2014-08-01 08:59:17,48 (and did not
reapper after that), and cassandra seemed  to be happy as it marked the
other two nodes as up for a while. But after a few tens of seconds, it
marked them as DOWN again and kept it in this state until restart three
days later, even the network connectivity was stable:

 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /77.234.42.20 is now UP
 INFO [HintedHandoff:2] 2014-08-01 09:02:37,305
HintedHandOffManager.java (line 296) Started hinted handoff for host:
9252f37c-1c9a-418b-a49f-6065511946e4 with IP: /77.234.42.20
 INFO [GossipStage:1] 2014-08-01 09:02:37,305 Gossiper.java (line 809)
InetAddress /77.234.44.20 is now UP
 INFO [HintedHandoff:1] 2014-08-01 09:02:37,308
HintedHandOffManager.java (line 296) Started hinted handoff for host:
97b1943a-3689-4e4a-a39d-d5a11c0cc309 with IP: /77.234.44.20
 INFO [BatchlogTasks:1] 2014-08-01 09:02:45,724 ColumnFamilyStore.java
(line 633) Enqueuing flush of Memtable-batchlog@1311733948(239547/247968
serialized/live bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,725 Memtable.java (line
398) Writing Memtable-batchlog@1311733948(239547/247968 serialized/live
bytes, 32 ops)
 INFO [FlushWriter:1221] 2014-08-01 09:02:45,738 Memtable.java (line
443) Completed flushing; nothing needed to be retained.  Commitlog
position was ReplayPosition(segmentId=1403712545417, position=20482758)
 INFO [HintedHandoff:1] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.44.20; aborting (0 delivered)
 INFO [HintedHandoff:2] 2014-08-01 09:02:47,312
HintedHandOffManager.java (line 427) Timed out replaying hints to
/77.234.42.20; aborting (0 delivered)
 INFO [GossipTasks:1] 2014-08-01 09:02:59,690 Gossiper.java (line 823)
InetAddress /77.234.44.20 is now DOWN
 INFO [GossipTasks:1] 2014-08-01 09:03:08,693 Gossiper.java (line 823)
InetAddress mia10.ff.avast.com/77.234.42.20 is now DOWN

After one day, the node started to discards hints it saved for the other
nodes.

 WARN [OptionalTasks:1] 2014-08-02 09:09:10,277
HintedHandoffMetrics.java (line 78) /77.234.44.20 has 9039 dropped
hints, because node is down past configured hint window.
 WARN [OptionalTasks:1] 2014-08-02 09:09:10,290
HintedHandoffMetrics.java (line 78) /77.234.42.20 has 8816 dropped
hints, because node is down past configured hint window.


The "nodetool status" marked the other two nodes as down. After restart,
it reconnected to the cluster without any problem.

What puzzles me is the fact that the authentization apparently started
to work after the network recovered but the exchange of data did not.

I would like to understand what could caused the problems and how to
avoid  them in the future.

Any pointers would be appreciated.

Regards
Jiri Horky



Re: upgrade from cassandra 1.2.3 -> 1.2.13 + start using SSL

2014-01-11 Thread Jiri Horky
Thank you both for the answers!

Jiri Horky

On 01/10/2014 02:52 AM, Aaron Morton wrote:
> We avoid mixing versions for a long time, but we always upgrade one
> node and check the application is happy before proceeding. e.g. wait
> for 30 minutes before upgrading the others. 
>
> If you snapshot before upgrading, and have to roll back after 30
> minutes you can roll back to the snapshot and use repair to fix the
> data on disk. 
>
> Hope that helps. 
>
> -
> Aaron Morton
> New Zealand
> @aaronmorton
>
> Co-Founder & Principal Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> On 9/01/2014, at 7:24 am, Robert Coli  <mailto:rc...@eventbrite.com>> wrote:
>
>> On Wed, Jan 8, 2014 at 1:17 AM, Jiri Horky > <mailto:ho...@avast.com>> wrote:
>>
>> I am specifically interested whether is possible to upgrade just one
>> node and keep it running like that for some time, i.e. if the gossip
>> protocol is compatible in both directions. We are a bit afraid to
>> upgrade all nodes to 1.2.13 at once in a case we would need to
>> rollback.
>>
>>
>> This not not officially supported. It will probably work for these
>> particular versions, but it is not recommended.
>>
>> The most serious potential issue is an inability to replace the new
>> node if it fails. There's also the problem of not being able to
>> repair until you're back on the same versions. And other, similar,
>> undocumented edge cases...
>>
>> =Rob
>>
>



upgrade from cassandra 1.2.3 -> 1.2.13 + start using SSL

2014-01-08 Thread Jiri Horky
Hi all,

I would appreciate an advice whether is a good idea to upgrade from
cassandra 1.2.3 to 1.2.13 and how to best proceed. The particular
cluster consists of 3 nodes (each one in a different DC having 1
replica) with a relativelly low traffic and 10GB load per node.

I am specifically interested whether is possible to upgrade just one
node and keep it running like that for some time, i.e. if the gossip
protocol is compatible in both directions. We are a bit afraid to
upgrade all nodes to 1.2.13 at once in a case we would need to rollback.

I know that the sstable format changed in 1.2.5 so in case we need to
rollback, the newly written data would need to be synchronized from the
old servers.

Also, after the migration to 1.2.13, we would like to start using
node-to-node encryption. I imagine that you need to configure it on all
nodes at once, so it would require a small outage.

Thank you in advance
Jiri Horky



Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-12 Thread Jiri Horky
Hi,
On 11/12/2013 05:29 AM, Aaron Morton wrote:
>>> Are you doing large slices or do could you have a lot of tombstones
>>> on the rows ? 
>> don't really know - how can I monitor that?
> For tombstones, do you do a lot of deletes ? 
> Also in v2.0.2 cfstats has this 
>
> Average live cells per slice (last five minutes): 0.0
> Average tombstones per slice (last five minutes): 0.0
>
> For large slices you need to check your code. e.g. do you anything
> that reads lots of columns or very large columns or lets the user
> select how many columns to read?
>
> The org.apache.cassandra.db.ArrayBackedSortedColumns in the trace back
> is used during reads (.e.g.
> org.apache.cassandra.db.filter.SliceQueryFilter)
thanks for explanation, will try to provide some figures (but
unfortunately not from the 2.0.2).
>
>>> You probably want the heap to be 4G to 8G in size, 10G will
>>> encounter longer pauses. 
>>> Also the size of the new heap may be too big depending on the number
>>> of cores. I would recommend trying 800M
>> I tried to decrease it first to 384M then to 128M with no change in
>> the behaviour. I don't really care extra memory overhead of the cache
>> - to be able to actual point to it with objects, but I don't really
>> see the reason why it should create/delete those many objects so
>> quickly. 
> Not sure what you changed to 384M.
Sorry for the confusion. I meant to say that I tried to decrease row
cache size to 384M and then to 128M and the GC times did not change at
all (still ~30% of the time).
>
>>> Shows the heap growing very quickly. This could be due to wide reads
>>> or a high write throughput. 
>> Well, both prg01 and prg02 receive the same load which is about
>> ~150-250 (during peak) read requests per seconds and 100-160 write
>> requests per second. The only with heap growing rapidly and GC
>> kicking in is on nodes with row cache enabled.
>
> This sounds like on a row cache miss cassandra is reading the whole
> row, which happens to be a wide row. I would also guess some writes
> are going to the rows and they are getting invalidated out of the row
> cache. 
>
> The row cache is not great for rows the update frequently and/or wide
> rows. 
>
> How big are the rows ? use nodetool cfstats and nodetool cfhistograms.
I will get in touch with the developers and take the data from cf*
commands in a few days (I am out of office for some days).

Thanks for the pointers, will get in touch.

Cheers
Jiri Horky



Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-08 Thread Jiri Horky
Hi,

On 11/07/2013 05:18 AM, Aaron Morton wrote:
>> Class Name  
>>| Shallow Heap | Retained Heap
>> ---
>>  
>>   |  |  
>> java.nio.HeapByteBuffer @ 0x7806a0848
>>   |   48 |80
>> '- name org.apache.cassandra.db.Column @ 0x7806424e8
>>|   32 |   112
>>|- [338530] java.lang.Object[540217] @ 0x57d62f560 Unreachable
>>   |2,160,888 | 2,160,888
>>|- [338530] java.lang.Object[810325] @ 0x591546540
>>   |3,241,320 | 7,820,328
>>|  '- elementData java.util.ArrayList @ 0x75e8424c0  
>>|   24 | 7,820,352
>>| |- list
>> org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator @
>> 0x5940e0b18  |   48 |   128
>>| |  '- val$filteredIter
>> org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48  
>>   |   32 | 7,820,568
>>| | '- val$iter
>> org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68
>> Unreachable   |   24 | 7,820,592
>>| |- this$0, parent java.util.ArrayList$SubList @ 0x5940e0bb8
>>|   40 |40
>>| |  '- this$1 java.util.ArrayList$SubList$1 @ 0x5940e0be0
>>   |   40 |80
>>| | '- currentSlice
>> org.apache.cassandra.db.ArrayBackedSortedColumns$SlicesIterator @
>> 0x5940e0b18|   48 |   128
>>| |'- val$filteredIter
>> org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48  
>> |   32 | 7,820,568
>>| |   '- val$iter
>> org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68
>> Unreachable |   24 | 7,820,592
>>| |- columns org.apache.cassandra.db.ArrayBackedSortedColumns
>> @ 0x5b0a33488  |   32 |56
>>| |  '- val$cf
>> org.apache.cassandra.db.filter.SliceQueryFilter$1 @ 0x5940e0b48  
>> |   32 | 7,820,568
>>| | '- val$iter
>> org.apache.cassandra.db.filter.QueryFilter$2 @ 0x5940e0b68
>> Unreachable   |   24 | 7,820,592
>>| '- Total: 3 entries
>>|  |  
>>|- [338530] java.lang.Object[360145] @ 0x7736ce2f0 Unreachable
>>   |1,440,600 | 1,440,600
>>'- Total: 3 entries  
>>|  |  
>
> Are you doing large slices or do could you have a lot of tombstones on
> the rows ?
don't really know - how can I monitor that?
>
>> We have disabled row cache on one node to see  the  difference. Please
>> see attached plots from visual VM, I think that the effect is quite
>> visible.
> The default row cache is of the JVM heap, have you changed to
> the ConcurrentLinkedHashCacheProvider ?
Answered by Chris already :) No.
>
> One way the SerializingCacheProvider could impact GC is if the CF
> takes a lot of writes. The SerializingCacheProvider invalidates the
> row when it is written to and had to read the entire row and serialise
> it on a cache miss.
>
>>> -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G
>>> -Xmn1024M -XX:+HeapDumpOnOutOfMemoryError
> You probably want the heap to be 4G to 8G in size, 10G will encounter
> longer pauses. 
> Also the size of the new heap may be too big depending on the number
> of cores. I would recommend trying 800M
I tried to decrease it first to 384M then to 128M with no change in the
behaviour. I don't really care extra memory overhead of the cache - to
be able to actual point to it with objects, but I don't really see the
reason why it should create/delete those many objects so quickly.
>
>
>> prg01.visual.vm.png
> Shows the heap growing very quickly. This could be due to wide reads
> or a high write throughput.
Well, both prg01 and prg02 receive the same load which is about ~150-250
(during peak) read requests per seconds and 100-160 write requests per
second. The only with heap growing rapidly and GC kicking in is on nodes
with row cache enabled.

>
> Hope that helps.
Thank you!

Jiri Horky



Cassandra 1.2.9 -> 2.0.0 update can't

2013-11-05 Thread Jiri Horky
Hi,

I know it is "Not Supported and also Not Advised", but currently we have
half of our cassandra cluster on 1.2.9 and the other half on 2.0.0 as we
got stuck during upgrade. We know already that it was not very wise
decision but that's what we've got now.

I have basically two questions. First, it would really help us if there
was a way how to make request to newer cassandra work with data on older
cassandra. We get
Caused by: com.datastax.driver.core.exceptions.UnavailableException: Not
enough replica available for query at consistency ONE (1 required but
only 0 alive),

which is surely caused by the fact that the newer nodes see the older as
DOWN in nodetool ring. We "solve" this problem by connecting to the
older nodes which see the whole cluster as online but it would help us
deploying newer version of our application if could connect to 2.0.0 as
well.

Second, to avoid any future problems when upgrading, is there any
documentation on what is incompatible between 1.2.x line and 2.0.x, what
is not supposed to work together and what are the recommended steps? I
don't see any warning regarding incompatibility of the gossip (?)
protocol between 2.0.0 and 1.2.9 causing the described problem in the
official documentation "Changes impacting upgrade".

Thank you
Jiri Horky


Re: Recompacting all sstables

2013-11-02 Thread Jiri Horky
Hi,

On 11/01/2013 09:15 PM, Robert Coli wrote:
> On Fri, Nov 1, 2013 at 12:47 PM, Jiri Horky  <mailto:ho...@avast.com>> wrote:
>
> since we upgraded half of our Cassandra cluster to 2.0.0 and we
> use LCS,
> we hit CASSANDRA-6284 bug.
>
>
> 1) Why upgrade a cluster to 2.0.0? Hopefully not a production cluster? [1]
I think you already guessed the answer :) It is a production cluster, we
needed some features (particularly, compare and set) only present in 2.0
because of the applications. Besides, somebody had to discover the
regression, right? :) Thanks for the link.
>
> 3) What do you mean by "upgraded half of our Cassandra cluster"? That
> is Not Supported and also Not Advised... for example, before the
> streaming change in 2.x line, a cluster in such a state may be unable
> to have nodes added, removed or replaced.
We are in the middle of the migration from 1.2.9 to 2.0 when we are also
upgrading our application which can only be run against 2.0 due  to
various technical details. It is rather hard to explain, but we hoped it
will last just for few days and it is definitely not the status we
wanted to keep. Since we hit the bug, we got stalled in the middle of
the migration.
>
> So the question. What is the best way to recompact all the sstables so
> the data in one sstables within a level would contain more or less the
> right portion of the data
>
> ... 
>
> Based on documentation, I can only think of switching to SizeTiered
> compaction, doing major compaction and then switching back to LCS.
>
>
> That will work, though be aware of  the implication of CASSANDRA-6092
> [2]. Briefly, if the CF in question is not receiving write load, you
> will be unable to promote your One Big SSTable from L0 to L1. In that
> case, you might want to consider running sstable_split (and then
> restarting the node) in order to split your One Big SSTable into two
> or more smaller ones.
Hmm, thinking about it a bit more, I am unsure this will actually help.
If I understand things correctly, assuming uniform distribution of new
received keys in L0 (ensured by RandomPartitioner), in order for LCS to
work optimally, I need:

a) get uniform distribution of keys across sstables in one level, i.e.
in every level each sstable will cover more or less the same range of keys
b) sstables in each level should cover almost whole space of keys the
node is responsible for
c) propagate sstables to higher levels in uniform fashion, e.g.
round-robin or random (over time, the probability of choosing an
sstables as candidate should be the same for all sstables in the level)

By splitting the sorted Big SStable, I will get a bunch of
non-overlapping sstables. So I will surely achieve a). Point c) is fixed
by the patch. But what about b)? It probably depends on order of
compaction across levels, i.e. whether the compactions in various levels
are being run in parallel and interleaved or not. In case it compacts
all the tables from one level and only after that starts to compact
sstables in higher level etc, one will end up in very similar situation
as caused by the referenced bug (because of round robin fashion of
choosing candidates), i.e. having the biggest keys in L1 and smallest
keys in the highest level. So in this case, it would actually not help
at all.

Does it make sense or am I completely wrong? :)

BTW: Not very though-out idea, but wouldn't it actually be better to
select candidates completely randomly?

Cheers
Jiri Horky

>
> =Rob
>
> [1] https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/
> [2] https://issues.apache.org/jira/browse/CASSANDRA-6092



Recompacting all sstables

2013-11-01 Thread Jiri Horky
Hi all

since we upgraded half of our Cassandra cluster to 2.0.0 and we use LCS,
we hit CASSANDRA-6284 bug. So basically all data in sstables created
after the upgrade are wrongly (non-uniformly within compaction levels)
distributed. This causes a huge overhead when compacting new sstables
(see the bug for the details).

After applying the patch, the distribution of the data within a level is
supposed to recover itself over time but we would like to not to wait a
month or so until it gets better.

So the question. What is the best way to recompact all the sstables so
the data in one sstables within a level would contain more or less the
right portion of the data, in other worlds, keys would be uniformly
distributed across sstables within a level? (e.g.: assumming total token
range for a node 1..1, and given that L2 should contain 100
sstables, , all sstables within L2 should cover a range of ~100 tokens).

Based on documentation, I can only think of switching to SizeTiered
compaction, doing major compaction and then switching back to LCS.

Thanks in advance
Jiri Horky