from:"Chris Burroughs"

Re: CQL Composite Key Seen After Table Creation

2016-01-15 Thread Chris Burroughs


On 01/06/2016 04:47 PM, Robert Coli wrote:

On Wed, Jan 6, 2016 at 12:54 PM, Chris Burroughs 
wrote:

The problem with that approach is that manually editing the local schema
tables in live cluster is wildly dangerous. I *think* this would work:


  * Make triple sure no schema changes are happening on the cluster.

  * Update schema tables on each node --> drain --> restart


I think that would work too, and probably be lower risk than modifying on
one and trying to get the others to pull via resetlocalschema. But I agree
it seems "wildly dangerous".


We did this, and a day later it appears successful.

I am still fuzzy on how schema "changes" propagate when you edit the 
schema tables directly and am unsure if the drain/restart rain dance was 
strictly necessary, but it felt safer. (Obviously even if I was sure 
now, that would not be behavior to count on, and I hope not to need to 
do this gain.)

Re: CQL Composite Key Seen After Table Creation

2016-01-06 Thread Chris Burroughs

I work with Amir and further experimentation I can shed a little more light on 
what exactly is going on under the hood.  For background our goal is to take 
data that is currently being read and written to via thrift, switch reads to 
CQL, and then switch writes to CQL.  This is in alternative to deleting all of 
our data and starting over, or being forever struck on super old thrift clients 
(both of those options obviously suck.)  The data models involved are absurdly 
simple (and single key with a handful of static columns).

TLDR: Metadata is complicated.  What is the least dangerous way to make direct 
changes to system.schema_columnfamilies and system.schema_columns?

Anyway, given some super simple Foo and Bar column families:

create keyspace Test with  placement_strategy = 
'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = 
{replication_factor:1};
use Test;
create column family Foo with comparator = UTF8Type and 
key_validation_class=UTF8Type and column_metadata = [ {column_name: title, 
validation_class: UTF8Type}];
create column family Bar with comparator = UTF8Type and 
key_validation_class=UTF8Type;
update column family Bar with column_metadata = [ {column_name: title, 
validation_class: UTF8Type}];

(The salient difference as described by Amir is when the column_metadata is 
set; at the same time as creation or later.)

Now we can inject a little data and see that from thrift everything looks fine:

[default@Test] set Foo['testkey']['title']='mytitle';
Value inserted.
Elapsed time: 19 msec(s).
[default@Test] set Bar['testkey']['title']='mytitle';
Value inserted.
Elapsed time: 4.47 msec(s).

[default@Test] list Foo;
Using default limit of 100
Using default cell limit of 100
---
RowKey: testkey
=> (name=title, value=mytitle, timestamp=1452108082972000)

1 Row Returned.
Elapsed time: 268 msec(s).
[default@Test] list Bar;
Using default limit of 100
Using default cell limit of 100
---
RowKey: testkey
=> (name=title, value=mytitle, timestamp=1452108093739000)

1 Row Returned.
Elapsed time: 9.3 msec(s).

But from cql the Bar column does not look like the data we wrote:

cqlsh> select * from "Test"."Foo";

 key | title
-+-
 testkey | mytitle

(1 rows)

cqlsh> select * from "Test"."Bar";

 key | column1 | value| title
-+-+--+-
 testkey |   title | 0x6d797469746c65 | mytitle

It's not just that these phantom columns are ugly, cql thinks column1 is part 
of a composite primary key.  Since there **is no column1**, that renderes the 
data un-query-able with WHERE clauses.

Just to make sure it's not thrift that is doing something unexpected, the 
sstables show the expected structure:

$ ./tools/bin/sstable2json 
/data/sstables/data/Test/Foo-d3348860b4af11e5b456639406f48f1b/Test-Foo-ka-1-Data.db

[
{"key": "testkey",
 "cells": [["title","mytitle",1452110466924000]]}
]

$ ./tools/bin/sstable2json 
/data/sstables/data/Test/Foo-d3348860b4af11e5b456639406f48f1b/Test-Foo-ka-1-Data.db

[
{"key": "testkey",
 "cells": [["title","mytitle",1452110466924000]]}
]

So, what appeared as innocent variation made years ago when the thrift schema 
was written causes very different results to cql.

Digging into the schema tables shows what is going on in more detail:

> select 
> keyspace_name,columnfamily_name,column_aliases,comparator,is_dense,key_aliases,value_alias
>  from system.schema_columnfamilies where keyspace_name='Test';

 keyspace_name | columnfamily_name | column_aliases | comparator
  | is_dense | key_aliases | value_alias
---+---++ 
+--+-+-
  Test |   Bar | ["column1"]   | 
org.apache.cassandra.db.marshal.UTF8Type | True | ["key"] |   value
  Test |   Foo |  []   | 
org.apache.cassandra.db.marshal.UTF8Type |False | ["key"] |null

> select keyspace_name,columnfamily_name,column_name,validator from 
> system.schema_columns where keyspace_name='Test';

 keyspace_name | columnfamily_name | column_name | validator
---+---+-+---
  Test |   Bar | column1 |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Bar | key |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Bar |   title |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Bar |   value | 
org.apache.cassandra.db.marshal.BytesType
  Test |   Foo | key |  
org.apache.cassandra.db.marshal.UTF8Type
  Test |   Foo |   title |  
org.apache.cassandra.db.marshal.UTF8Type

Now the interesting bit is that the metadata can  be manually "fixed":

UPDATE system.schema_columnfamilies

Re: Migration 1.2.14 to 2.0.8 causes "Tried to create duplicate hard link" at startup

2014-06-10 Thread Chris Burroughs


Were you able to solve or work around this problem?

On 06/05/2014 11:47 AM, Tom van den Berge wrote:

Hi,

I'm trying to migrate a development cluster from 1.2.14 to 2.0.8. When
starting up 2.0.8, I'm seeing the following error in the logs:


  INFO 17:40:25,405 Snapshotting drillster, Account to
pre-sstablemetamigration
ERROR 17:40:25,407 Exception encountered during startup
java.lang.RuntimeException: Tried to create duplicate hard link to
/Users/tom/cassandra-data/data/drillster/Account/snapshots/pre-sstablemetamigration/drillster-Account-ic-65-Filter.db
 at
org.apache.cassandra.io.util.FileUtils.createHardLink(FileUtils.java:75)
 at
org.apache.cassandra.db.compaction.LegacyLeveledManifest.snapshotWithoutCFS(LegacyLeveledManifest.java:129)
 at
org.apache.cassandra.db.compaction.LegacyLeveledManifest.migrateManifests(LegacyLeveledManifest.java:91)
 at
org.apache.cassandra.db.compaction.LeveledManifest.maybeMigrateManifests(LeveledManifest.java:617)
 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:274)
 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496)
 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585)


Does anyone have an idea how to solve this?


Thanks,
Tom

Re: Number of rows under one partition key

2014-06-04 Thread Chris Burroughs


https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/

Although by the simplistic version count hueirstic the sheer quantity of 
releases in the 2.0.x line would now satisfy the constraint.


On 05/29/2014 08:08 PM, Paulo Ricardo Motta Gomes wrote:

Hey,

We are considering upgrading from 1.2 to 2.0, why don't you consider 2.0
ready for production yet, Robert? Have you wrote about this somewhere
already?

A bit off-topic in this discussion but it would be interesting to know,
your posts are generally very enlightening.

Cheers,


On Thu, May 29, 2014 at 8:51 PM, Robert Coli  wrote:


On Thu, May 15, 2014 at 6:10 AM, Vegard Berget  wrote:


I know this has been discussed before, and I know there are limitations
to how many rows one partition key in practice can handle.  But I am not
sure if number of rows or total data is the deciding factor.



Both. In terms of data size, partitions containing over a small number of
hundreds of Megabytes begin to see diminishing returns in some cases.
Partitions over 64 megabytes are compacted on disk, which should give you a
rough sense of what Cassandra considers a "large" partition.



Should we add another partition key to avoid 1 000 000 rows in the same
thrift-row (which is how I understand it is actually stored)?  Or is 1 000
000 rows okay?



Depending on row size and access patterns, 1Mn rows is not extremely
large. There are, however, some row sizes and operations where this order
of magnitude of columns might be slow.



Other considerations, for example compaction strategy and if we should do
an upgrade to 2.0 because of this (we will upgrade anyway, but if it is
recommended we will continue to use 2.0 in development and upgrade the
production environment sooner)



You should not upgrade to 2.0 in order to address this concern. You should
upgrade to 2.0 when it is stable enough to run in production, which IMO is
not yet. YMMV.



I have done some testing, inserting a million rows and selecting them
all, counting them and selecting individual rows (with both clientid and
id) and it seems fine, but I want to ask to be sure that I am on the right
track.



If the access patterns you are using perform the way you would like with
representative size data, sounds reasonable to me?

If you are able to select all million rows within a reasonable percentage
of the relevant timeout, I presume they cannot be too huge in terms of data
size! :D

=Rob

Re: alternative vnode upgrade strategy?

2014-06-04 Thread Chris Burroughs


On 05/28/2014 02:18 PM, William Oberman wrote:

1.) Upgrade all N nodes to vnodes in place
Start loop
2.) Boot a new node and let it bootstrap
3.) Decommission an old node
End loop


I's been a while since I had to think about the vnode migration, but 
I've think this would fall pray to 
https://issues.apache.org/jira/browse/CASSANDRA-5525

Re: New node Unable to gossip with any seeds

2014-06-04 Thread Chris Burroughs

This generally means that how you are describing the see nodes address 
doesn't match how it's described in the second node seeds list in the 
correct way.


CASSANDRA-6523 has some links that might be helpful.

On 05/26/2014 12:07 AM, Tim Dunphy wrote:

Hello,

  I am trying to spin up a new node using cassandra 2.0.7. Both nodes are at
Digital Ocean. The seed node is up and running and I can telnet to port
7000 on that host from the node I'm trying to start.

[root@cassandra02 apache-cassandra-2.0.7]# telnet 10.10.1.94 7000

Trying 10.10.1.94...

Connected to 10.10.1.94.

Escape character is '^]'.

But when I start cassandra on the new node I see the following exception:


INFO 00:01:34,744 Handshaking version with /10.10.1.94

ERROR 00:02:05,733 Exception encountered during startup

java.lang.RuntimeException: Unable to gossip with any seeds

 at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193)

 at
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447)

 at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:505)

 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362)

 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569)

java.lang.RuntimeException: Unable to gossip with any seeds

 at
org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1193)

 at
org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:447)

 at
org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:656)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:612)

 at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:505)

 at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:362)

 at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:480)

 at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:569)

Exception encountered during startup: Unable to gossip with any seeds

ERROR 00:02:05,742 Exception in thread
Thread[StorageServiceShutdownHook,5,main]

java.lang.NullPointerException

 at org.apache.cassandra.gms.Gossiper.stop(Gossiper.java:1270)

 at
org.apache.cassandra.service.StorageService$1.runMayThrow(StorageService.java:573)

 at
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)

 at java.lang.Thread.run(Thread.java:745)



I'm using the murmur3 partition on both nodes and I have the seed node's IP
listed in the cassandra.yaml of the new node. I'm just wondering what the
issue might be and how I can get around it.


Thanks

Tim

Re: Is the tarball for a given release in a Maven repository somewhere?

2014-05-22 Thread Chris Burroughs

Maven central has "bin.tar.gz"  "src.tar.gz" downloads for the 
'apache-cassandra' artifact.  Does that work for your use case?


http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22apache-cassandra%22

On 05/20/2014 05:30 PM, Clint Kelly wrote:

Hi all,

I am using the maven assembly plugin to build a project that contains
a development environment for a project that we've built at work on
top of Cassandra.  I'd like this development environment to include
the latest release of Cassandra.

Is there a maven repo anywhere that contains an artifact with the
Cassandra release in it?  I'd like to have the same Cassandra tarball
that you can download from the website be a dependency for my project.
  I can then have the assembly plugin untar it and customize some of
the conf files before taring up our entire development environment.
That way, anyone using our development environment would have access
to the various shell scripts and tools.

I poked around online and could not find what I was looking for.  Any
help would be appreciated!

Best regards,
Clint

Re: Backup procedure

2014-05-16 Thread Chris Burroughs

It's also good to note that only the Data files are compressed already. 
 Depending on your data the Index and other files may be a significant 
percent of total on disk data.


On 05/02/2014 01:14 PM, tommaso barbugli wrote:

In my tests compressing with lzop sstables (with cassandra compression
turned on) resulted in approx. 50% smaller files.
Thats probably because the chunks of data compressed by lzop are way bigger
than the average size of writes performed on Cassandra (not sure how data
is compressed but I guess it is done per single cell so unless one stores)


2014-05-02 19:01 GMT+02:00 Robert Coli :


On Fri, May 2, 2014 at 2:07 AM, tommaso barbugli wrote:


If you are thinking about using Amazon S3 storage I wrote a tool that
performs snapshots and backups on multiple nodes.
Backups are stored compressed on S3.
https://github.com/tbarbugli/cassandra_snapshotter



https://github.com/JeremyGrosser/tablesnap

SSTables in Cassandra are compressed by default, if you are re-compressing
them you may just be wasting CPU.. :)

=Rob

Re: What does the "rate" signify for latency in the JMX Metrics?

2014-05-16 Thread Chris Burroughs

They are exponential decaying moving averages (like Unix load averages) 
of the number of events per unit of time.


http://wiki.apache.org/cassandra/Metrics might help

On 04/17/2014 06:06 PM, Redmumba wrote:

Good afternoon,

I'm attempting to integrate the metrics generated via JMX into our internal
framework; however, the information for several of the metrics includes a
One/Five/Fifteen-minute "rate", with the RateUnit in "SECONDS".  For
example:

$>get -b

org.apache.cassandra.metrics:name=Latency,scope=Write,type=ClientRequest *
#mbean =
org.apache.cassandra.metrics:name=Latency,scope=Write,type=ClientRequest:
LatencyUnit = MICROSECONDS;

EventType = calls;

RateUnit = SECONDS;

MeanRate = 383.6944837362387;

FifteenMinuteRate = 868.8420188648543;

FiveMinuteRate = 817.5239450236011;

OneMinuteRate = 675.7673129014964;

Max = 498867.0;

Count = 31257426;

Min = 52.0;

50thPercentile = 926.0;

Mean = 1063.114029159023;

StdDev = 1638.1542477604232;

75thPercentile = 1064.75;

95thPercentile = 1304.55;

98thPercentile = 1504.39992;

99thPercentile = 2307.35104;

999thPercentile = 10491.8502;



What does the rate signify in this context?  For example, given the
OneMinuteRate of  675.7673129014964 and the unit of "seconds"--what is this
measuring?  Is this the rate of which metrics are submitted? i.e., there
were an average of (676 * 60 seconds) metrics submitted over the last
minute?

Thanks!

Re: row caching for frequently updated column

2014-05-14 Thread Chris Burroughs


You are close.

On 04/30/2014 12:41 AM, Jimmy Lin wrote:

thanks all for the pointers.

let' me see if I can put the sequences of event together 

1.2
people mis-understand/mis-use row cache, that cassandra cached the entire
row of data even if you are only looking for small subset of the row data.
e.g
select single_column from a_wide_row_table
will result in entire row cached even if you are only interested in one
single column of a row.



Yep!


2.0
and because of potential misuse of heap memory, Cassandra 2.0 remove heap
cache, and only support off-heap cache, which has a side effect that write
will invalidate the row cache(my original question)



"off-heap" is a common but misleading name for the 
SerializingCacheProvider.  It still stores several objects on heap per 
cached item and has to deser on read.



2.1
the coming 2.1 Cassandra will offer true cache by query, so the cached data
will be much more efficient even for wide rows(it cached what it needs).

do I get it right?
for the new 2.1 row caching, is it still true that a write or update to the
row will still invalidate the cached row ?



I don't think "true cache by query" is an accurate description of 
CASSANDRA-5357.  I think it's more like a "head of the row" cache.

Re: mixed nodes, some SSD some HD

2014-03-05 Thread Chris Burroughs

No.  If you have a heterogeneous clusters you should consider adjusting 
the number of vnodes per physical node.


On 03/04/2014 10:47 PM, Elliot Finley wrote:

Using Cassandra 2.0.x

If I have a 3 node cluster and 2 of the nodes use spinning drives and 1 of
them uses SSD,  will the majority of the reads be routed to the SSD node
automatically because it has faster responses?

TIA,
Elliot

Re: Thrift Server Implementations

2014-03-05 Thread Chris Burroughs


On 02/13/2014 01:37 PM, Christopher Wirt wrote:

Anyway, today I moved the old HsHa implementation and the new
TThreadSelectorServer into a 2.0.5 checkout, hooked them in, built, did a
bit of testing and I'm now running live.



We found the TThreadSelectorServer performed the best getting us back under
our SLA.


Are you still running with the upstream TThreadSelectorServer?  Based on 
your experience is there any reason Cassandra should not adapt it.

Re: ring describe returns only public ips

2014-02-10 Thread Chris Burroughs

More generally, a thrift api or other mechanism for Astyanax to get the 
INTERNAL_IP seems necessary to use ConnectionPoolType.TOKEN_AWARE + 
NodeDiscoveryType.TOKEN_AWARE in a multi-dc setup.  Absent one I'm 
confused how that combination is possible.


On 02/06/2014 03:17 PM, Ted Pearson wrote:

We are using Cassandra 1.2.13 in a multi-datacenter setup. We are using 
Astyanax as the client, and we’d like to enable its token aware connection pool 
type and ring describe node discovery type. Unfortunately, I’ve found that both 
thrift’s describe_ring and `nodetool ring` only report the public IPs of the 
cassandra nodes. This means that Astyanax tries to reconnect to the public IPs 
of each node, which doesn’t work and just results in no hosts being available 
for queries according to Astyanax.

I know from `nodetool gossipinfo` (and the fact that the clusters work) that 
it's sharing the LOCAL_IP via gossip, but have no idea how or if it’s possible 
to get describe_ring to return local IPs, or if there is some alternative.

Thanks,

-Ted

Re: First SSTable file is not being compacted

2014-02-06 Thread Chris Burroughs


Sounds like you have done some solid test work.

I suggest reading https://issues.apache.org/jira/browse/CASSANDRA-6568 
and if you think your issue is the same adding your reproduction case 
there, otherwise create your own ticket.


On 02/06/2014 10:53 AM, Sameer Farooqui wrote:

Yeah, it's definitely repeatable. I have a lab environment set up where the
issue is occurring and I've recreated the lab environment 4 - 5 times and
it's occurred each time.

In my demodb.users CF I currently have 2 data SSTables on disk
(demodb-users-jb-1-Data.db and demodb-users-jb-6-Data.db). However, in
OpsCenter the CF: SSTable Count (demodb.users) graph shows only one SSTable.

The nodetool cfstats command also shows "SSTable count: 1" for this CF.


- SF


On Thu, Feb 6, 2014 at 8:54 AM, Chris Burroughs
wrote:


On 02/06/2014 01:17 AM, Sameer Farooqui wrote:


I'm running C* 2.0.4 and when I have a handful of SSTable files and
trigger
a manual compaction with 'nodetool compact' the first SSTable file doesn't
get compacted away.

Is there something special about the first SSTable that it remains even
after a SizedTierCompaction?




No, this is not expected behavior.  Do the number of live SSTables
reported match what is on disk?  Do you have a procedure that can repeat
this?

Re: First SSTable file is not being compacted

2014-02-06 Thread Chris Burroughs


On 02/06/2014 01:17 AM, Sameer Farooqui wrote:

I'm running C* 2.0.4 and when I have a handful of SSTable files and trigger
a manual compaction with 'nodetool compact' the first SSTable file doesn't
get compacted away.

Is there something special about the first SSTable that it remains even
after a SizedTierCompaction?



No, this is not expected behavior.  Do the number of live SSTables 
reported match what is on disk?  Do you have a procedure that can repeat 
this?

Re: what tool will create noncql columnfamilies in cassandra 3a

2014-02-06 Thread Chris Burroughs


On 02/05/2014 04:57 AM, Sylvain Lebresne wrote:

>How will users adjust the meta data of non cql column families

The rational for removing cassandra-cli is mainly that maintaining 2 fully
featured command line interface is a waste of the project resources in the
long
run. It's just a tool using the thrift interface however and you'll still be
able to adjust metadata through the thrift interface as before. As Patricia
mentioned, there is even some existing interactive options like pycassaShell
in the community.


It's also wasteful for the community to maintain multiple post 3.0 forks 
for cassandra-cli so they can continue using Cassandra.  It would be 
more efficient if they cool pool their resources in a central place, 
like a code repo at Apache.

Re: Question about local reads with multiple data centers

2014-02-06 Thread Chris Burroughs


On 01/29/2014 08:07 PM, Donald Smith wrote:

My question: will the read process try to read first locally from the 
datacenter DC2 I specified in its connection string? I presume so.  (I 
doubt that it uses the client's IP address to decide which datacenter is 
closer. And I am unaware of another way to tell it to read locally.)



From the rest if this thread it looks like you were asking about how 
the client selected a Cassandra node to act as a coordinator.  Note 
however that if you are using a DC oblivious CL (ONE, QUORUM) then that 
Cassandra coordinator may send requests to the remote data center.




Also, will read repair happen between datacenters automatically 
("read_repair_chance=0.10")?  Or does that only happen within a single data 
center?


Yes read_repair_chance is global.  There is a separate dc_local repair 
chance if you want to make local reap repairs more common.

Re: Question: ConsistencyLevel.ONE with multiple datacenters

2014-02-06 Thread Chris Burroughs

I think the scenario you outlined is correct.  The DES handles multiple 
DCs poorly and the LOCAL_ONE hammer is the best bet.


On 01/31/2014 12:40 PM, Paulo Ricardo Motta Gomes wrote:

Hey,

When adding a new data center to our production C* datacenter using the
procedure described in [1], some of our application requests were returning
null/empty values. Rebuild was not complete in the new datacenter, so my
guess is that some requests were being directed to the brand new datacenter
which still didn't have the data.

Our Hector client was connected only to the original nodes, with
autoDiscoverHosts=false and we use ConsistencyLevel.ONE for reads. The
keyspace schema was already configured to use both data centers.

My question is: is it possible that the dynamic snitch is choosing the
nodes in the new (empty) datacenter when CL=ONE? In this case, it's
mandatory to use CL=LOCAL_ONE during bootstrap/rebuild of a new datacenter,
otherwise empty data might be returned, correct?

Cheers,

[1]
http://www.datastax.com/documentation/cassandra/1.2/webhelp/cassandra/operations/ops_add_dc_to_cluster_t.html

Re: Row cache vs. OS buffer cache

2014-01-23 Thread Chris Burroughs


My experience has been that the row cache is much more effective.
However, reasonable row cache sizes are so small relative to RAM that I 
don't see it as a significant trade-off unless it's in a very memory 
constrained environment.  If you want to enable the row cache (a big if) 
you probably want it to be as big as it can be until you have reached 
the point of diminishing returns on the hit rate.


The "off-heap" cache still has many on-heap objects so it's doesn't 
really change that much conceptually, you will just end up with a 
different number for the "size".


On 01/23/2014 02:13 AM, Katriel Traum wrote:

Hello list,

I was if anyone has any pointers or some advise regarding using row cache
vs leaving it up to the OS buffer cache.

I run cassandra 1.1 and 1.2 with JNA, so off-heap row cache is an option.

Any input appreciated.
Katriel

Re: nodetool cleanup / TTL

2014-01-07 Thread Chris Burroughs


On 01/07/2014 01:38 PM, Tyler Hobbs wrote:

On Tue, Jan 7, 2014 at 7:49 AM, Chris Burroughs
wrote:


This has not reached a consensus in #cassandra in the past.  Does
`nodetool cleanup` also remove data that has expired from a TTL?



No, cleanup only removes rows that the node is not a replica for.



Is there some other mechanism for forcing expired data to be removed 
without also compacting? (major compaction having obvious problematic 
side effects, and user defined compaction being significant work to 
script up).

nodetool cleanup / TTL

2014-01-07 Thread Chris Burroughs

This has not reached a consensus in #cassandra in the past.  Does 
`nodetool cleanup` also remove data that has expired from a TTL?

Re: vnode in production

2014-01-06 Thread Chris Burroughs


On 01/06/2014 01:56 PM, Arindam Barua wrote:

Thanks for your responses. We are on 1.2.12 currently.
The fixes in 1.2.13 seem to help for clusters in the 500+ node range (like 
CASSANDRA-6409). Ours is below 50 now, so we plan to go ahead and enable vnodes 
with the 'add a new DC' procedure. We will try to upgrade to 1.2.13 or 1.2.14 
subsequently.


Your plan seems reasonable but in the interest of full disclosure 
CASSANDRA-6345 has been observed as a significant issue for clusters in 
the 50-75 node range.

Re: vnode in production

2014-01-06 Thread Chris Burroughs


On 01/02/2014 01:51 PM, Arindam Barua wrote:

1.   the stability of vnodes in production


I'm happily using vnodes in production now, but I would have trouble 
calling them stable for more than small clusters until very recently 
(1.2.13). CASSANDRA-6127 served as a master ticket for most of the 
issues if you are interested in the details.



2.   upgrading to vnodes in production


I am not aware of anyone who has succeeded with shuffle in production, 
but the 'add a new DC' procedure works.

Re: How to measure data transfer between data centers?

2013-12-04 Thread Chris Burroughs

https://wiki.apache.org/cassandra/Metrics has per node Streaming metrics 
that include total bytes/in out.  That is only a small bit of what you 
want though.


For total DC bandwidth it might be more straightforward to measure this 
at the router/switch/fancy-network-gear level.


On 12/03/2013 06:25 AM, Tom van den Berge wrote:

Is there a way to know how much data is transferred between two nodes, or
more specifically, between two data centers?

I'm especially interested in how much data is being replicated from one
data center to another, to know how much of the available bandwidth is used.


Thanks,
Tom

MiscStage Backup

2013-11-26 Thread Chris Burroughs

I'm trying to debug a node that has a backup in MiscStage.  Starting a 
bit under 24 hours ago the number of Pending tasks jumped to a bit under 
400 and hovered around there.  It looks like repair requests from other 
nodes  (tpstats on this node shows AntiEntropySessions: 0, 0, 0, which I 
think indicates it did not originate the repair).  After each MiscStage 
task completes a series of Streams are kicked off.


I am confused why MiscStage is backing up:
 (A) This node has only been down a few hours over the past week so it 
should not be wildly out of sync
 (B) no other node in this cluster has had a comparable backup of 
pending Misc stages.


Repairs are run on all nodes once a week.  Physical resources on this 
node are not particularity saturated compared to the rest of the 
cluster; reads are slower but I can't tell cause from effect in that case.


Graph of MiscStage pending tasks: http://imgur.com/sHqHTvt

This is with a 1.2.11-ish dual-DC vnode cluster.
"MiscStage:1" daemon prio=10 tid=0x7f84e8598800 nid=0x43b2 waiting on 
condition [0x7f83c3734000]
   java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x00069d23c700> (a 
java.util.concurrent.FutureTask$Sync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:969)
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1281)
at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:218)
at java.util.concurrent.FutureTask.get(FutureTask.java:83)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFuture(FBUtilities.java:375)
at 
org.apache.cassandra.utils.FBUtilities.waitOnFutures(FBUtilities.java:368)
at 
org.apache.cassandra.streaming.StreamOut.flushSSTables(StreamOut.java:108)
at 
org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:136)
at 
org.apache.cassandra.streaming.StreamOut.transferRanges(StreamOut.java:116)
at 
org.apache.cassandra.streaming.StreamRequestVerbHandler.doVerb(StreamRequestVerbHandler.java:44)
at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:56)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Re: Cassandra 1.1.6 - New node bootstrap not completing

2013-11-08 Thread Chris Burroughs


On 11/01/2013 03:03 PM, Robert Coli wrote:

On Fri, Nov 1, 2013 at 9:36 AM, Narendra Sharma
wrote:


I was successfully able to bootstrap the node. The issue was RF > 2.
Thanks again Robert.



For the record, I'm not entirely clear why bootstrapping two nodes into the
same range should have caused your specific bootstrap problem, but I am
glad to hear that bootstrapping one node at a time was a usable workaround.

=Rob



(A) If it can't work shouldn't a node refuse to bootstrap if it sees 
another node already in that state?


(B) It would be nice if nodes in independent DCs could at least be 
bootstrapped at the same time.

Re: Why truncate previous hints when upgrade from 1.1.9 to 1.2.6?

2013-11-08 Thread Chris Burroughs


NEWS.txt has some details and suggested procedures

- The hints schema was changed from 1.1 to 1.2. Cassandra automatically
  snapshots and then truncates the hints column family as part of
  starting up 1.2 for the first time.  Additionally, upgraded nodes
  will not store new hints destined for older (pre-1.2) nodes. It is
  therefore recommended that you perform a cluster upgrade when all
  nodes are up. Because hints will be lost, a cluster-wide repair (with
  -pr) is recommended after upgrade of all nodes.

On 11/07/2013 07:33 AM, Boole.Z.Guo (mis.cnsh04.Newegg) 41442 wrote:

Hi all,
When I upgrade C* from 1.1.9 to 1.2.6, I notice that the previous 
hintscolumnfamily would be directly truncated.
Can you tell me why ?
Because consistency is important to my services.


Best Regards,
Boole Guo

Re: Endless loop LCS compaction

2013-11-08 Thread Chris Burroughs


On 11/07/2013 06:48 AM, Desimpel, Ignace wrote:

Total data size is only 3.5GB. Column family was created with SSTableSize : 10 
MB


You may want to try a significantly larger size.

https://issues.apache.org/jira/browse/CASSANDRA-5727

Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-07 Thread Chris Burroughs


On 11/06/2013 11:18 PM, Aaron Morton wrote:

The default row cache is of the JVM heap, have you changed to the 
ConcurrentLinkedHashCacheProvider ?


ConcurrentLinkedHashCacheProvider was removed in 2.0.x.

Re: Cass 2.0.0: Extensive memory allocation when row_cache enabled

2013-11-06 Thread Chris Burroughs

Both caches involve several objects per entry (What do we want?  Packed 
objects.  When do we want them? Now!).  The "size" is an estimate of the 
off heap values only and not the total size nor number of entries.


An acceptable size will depend on your data and access patterns.  In one 
case we had a cluster that at 512mb would go into a GC death spiral 
despite plenty of free heap (presumably just due to the number of 
objects) while empirically the cluster runs smoothly at 384mb.


Your caches appear on the larger size, I suggest trying smaller values 
and only increase when it produces measurable sustained gains.


On 11/05/2013 04:04 AM, Jiri Horky wrote:

Hi there,

we are seeing extensive memory allocation leading to quite long and
frequent GC pauses when using row cache. This is on cassandra 2.0.0
cluster with JNA 4.0 library with following settings:

key_cache_size_in_mb: 300
key_cache_save_period: 14400
row_cache_size_in_mb: 1024
row_cache_save_period: 14400
commitlog_sync: periodic
commitlog_sync_period_in_ms: 1
commitlog_segment_size_in_mb: 32

-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms10G -Xmx10G
-Xmn1024M -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/data2/cassandra-work/instance-1/cassandra-1383566283-pid1893.hprof
-Xss180k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly -XX:+UseTLAB -XX:+UseCondCardMark

We have disabled row cache on one node to see  the  difference. Please
see attached plots from visual VM, I think that the effect is quite
visible. I have also taken 10x "jmap -histo" after 5s on a affected
server and plotted the result, attached as well.

I have taken a dump of the application when the heap size was 10GB, most
of the memory was unreachable, which was expected. The majority was used
by 55-59M objects of HeapByteBuffer, byte[] and
org.apache.cassandra.db.Column classes. I also include a list of inbound
references to the HeapByteBuffer objects from which it should be visible
where they are being allocated. This was acquired using Eclipse MAT.

Here is the comparison of GC times when row cache enabled and disabled:

prg01 - row cache enabled
   - uptime 20h45m
   - ConcurrentMarkSweep - 11494686ms
   - ParNew - 14690885 ms
   - time spent in GC: 35%
prg02 - row cache disabled
   - uptime 23h45m
   - ConcurrentMarkSweep - 251ms
   - ParNew - 230791 ms
   - time spent in GC: 0.27%

I would be grateful for any hints. Please let me know if you need any
further information. For now, we are going to disable the row cache.

Regards
Jiri Horky

Re: [RELEASE] Apache Cassandra 2.0.2 released

2013-10-29 Thread Chris Burroughs


On 10/28/2013 06:20 AM, Sylvain Lebresne wrote:

[2]:http://goo.gl/uEtkmb  (NEWS.txt)


https://wiki.apache.org/cassandra/Metrics has been updated with a 
reference to the new Configurable metrics reporting.

Re: How to use Cassandra on-node storage engine only?

2013-10-23 Thread Chris Burroughs

As far as I know this had not been done before.  I would be interested 
in hearing how it turned out.


On 10/23/2013 09:47 AM, Yasin Celik wrote:

I am developing an application for data storage. All the replication,
routing and data retrieving types of business are handled in my
application. Up to now, the data is stored in memory. Now, I want to use
Cassandra storage engine to flush data from memory into hard drive. I am
not sure if that is a correct approach.

My question: Can I use the Cassandra data storage engine only? I do not
want to use Cassandra as a whole standalone product (In this case, I should
run one independent Cassandra per node and my application act as if it is
client of Cassandra. This idea will put a lot of burden on node since it
puts unnecessary levels between my application and storage engine).

I have my own replication, ring and routing code. I only need the on-node
storage facilities of Cassandra. I want to embed cassandra in my
application as a library.

Re: Huge multi-data center latencies

2013-10-23 Thread Chris Burroughs


On 10/21/2013 07:03 PM, Hobin Yoon wrote:

Another question is how do you get the local DC name?



Have a look at org.apache.cassandra.db.EndpointSnitchInfo.getDatacenter

Re: nodetool status reporting dead node as UN

2013-10-23 Thread Chris Burroughs

When debugging gossip related problems (is this node really 
down/dead/some-werid state) you might have better luck looking at 
`nodetool gossipinfo`.  The "UN even though everything is bad thing" 
might be https://issues.apache.org/jira/browse/CASSANDRA-5913


I'm not sure what exactly what happened in your case.  I'm also confused 
why an IP changed on restart.


On 10/17/2013 06:12 PM, Philip Persad wrote:

Hello,

I seem to have gotten my cluster into a bit of a strange state.
Pardon the rather verbose email, but there is a fair amount of
background.  I'm running a 3 node Cassandra 2.0.1 cluster.  This
particular cluster is used only rather intermittently for dev/testing
and does not see particularly heavy use, it's mostly a catch-all
cluster for environments which don't have a dedicated cluster to
themselves.  I noticed today that one of the nodes had died because
nodetool repair was failing due to a down replica.  I run nodetool
status and sure enough, one of my nodes shows up as down.

When I looked on the actual box, the cassandra process was up and
running and everything in the logs looked sensible.  The most
controversial thing I saw was 1 CMS Garbage Collection per hour, each
taking ~250 ms.  None the less, the node was not responding, so I
restarted it.  So far so good, everything is starting up, my ~30
column families across ~6 key spaces are all initializing.  The node
then handshakes with my other two nodes and reports them both as up.
Here is where things get strange.  According to the logs on the other
two nodes, the third node has come back up and all is well.  However
in the third node, I see a wall of the following in the logs (IP
addresses masked):

  INFO [GossipTasks:1] 2013-10-17 20:22:25,652 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [GossipTasks:1] 2013-10-17 20:22:25,653 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:25,655
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:3] 2013-10-17 20:22:25,658 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [GossipTasks:1] 2013-10-17 20:22:26,654 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:26,657
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:4] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [RequestResponseStage:3] 2013-10-17 20:22:26,660 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [GossipTasks:1] 2013-10-17 20:22:27,655 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:27,660
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:4] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [RequestResponseStage:3] 2013-10-17 20:22:27,662 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [HANDSHAKE-/10.21.5.221] 2013-10-17 20:22:28,254
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.221
  INFO [GossipTasks:1] 2013-10-17 20:22:28,657 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [RequestResponseStage:4] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
  INFO [RequestResponseStage:3] 2013-10-17 20:22:28,660 Gossiper.java
(line 789) InetAddress /x.x.x.221 is now UP
  INFO [HANDSHAKE-/10.21.5.222] 2013-10-17 20:22:28,661
OutboundTcpConnection.java (line 386) Handshaking version with
/x.x.x.222
  INFO [RequestResponseStage:4] 2013-10-17 20:22:28,663 Gossiper.java
(line 789) InetAddress /x.x.x.222 is now UP
  INFO [GossipTasks:1] 2013-10-17 20:22:29,658 Gossiper.java (line 806)
InetAddress /x.x.x.222 is now DOWN
  INFO [GossipTasks:1] 2013-10-17 20:22:29,660 Gossiper.java (line 806)
InetAddress /x.x.x.221 is now DOWN

Additional, client requests to the cluster at consistency QUORUM start
failing (saying 2 responses were required but only 1 replica
responded).  According to nodetool status, all the nodes are up.

This is clearly not good.  I take down the problem node.  Nodetool
reports it down and QUORUM client reads/writes start working again.
In an attempt to get the cluster back into a good state, I delete all
the data on the problem node and then bring it back up.  The other two
nodes log a changed host ID for the IP of the node I wiped and then
handshake with it.  The problem node also comes up, but reads/writes
start failing again with the same error.

I decide to take the problem node down again.  However this time, even
after the process is dead, nodetool and the other two nodes report
that my third node is still up and requests to the cluster continue to
fail.  Running nodetool status against either of the live nodes shows
that all nodes are up.  Running nodetool status against

Re: The performance difference of online bulk insertion and the file-based bulk loading

2013-10-23 Thread Chris Burroughs


On 10/15/2013 08:41 AM, José Elias Queiroga da Costa Araújo wrote:

- is that is there a way that we can warm-up the cache, after the
file-based bulk loading, so that we can allow the data to be cached first
in the memory, and then afterwards, when we issue the bulk retrieval, the
performance can be closer to what is provided by the online-bulk-insertion.


Somewhat hacky, but you can at least warm of the OS page cache by `cat 
FILES > /dev/null`

vnode + multi dc migration

2013-10-11 Thread Chris Burroughs

I know there is a good deal of interest [1] on feasible methods for 
enabling vnodes on clusters that did not start with them.


We recently completed a migration from a production cluster not using 
vnodes and in a single DC to one using vnodes in two DCs.  We used the 
"just spin up a new DC and rebuild" strategy instead of shuffle and it 
worked.  The checklist was long but it really wasn't more complicated 
than that.  Thanks to several people in #cassandra for suggesting the 
technique and reviewing procedures.


One oddity we noticed is that when nodes in a new DC join 
(auto_bootstrap:false) CL.ONE performance tanked [2].  The spike is when 
the nodes came online, and the drop is when reads were switched to 
CL.LOCAL_QUORUM  This only happened when the new DC was cross-continent 
(not a logical DC in the same colo).


[1] 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201308.mbox/%3CCAEDUwd12vhRJbPZpVJ6QzTOx3pwU=11hhgkkipghhgvosbj...@mail.gmail.com%3E


[2] http://i.imgur.com/ZW5Ob8V.png

Re: Multi-dc restart impact

2013-10-10 Thread Chris Burroughs


Thanks, double checked; reads are CL.ONE.

On 10/10/2013 11:15 AM, J. Ryan Earl wrote:

Are you doing QUORUM reads instead of LOCAL_QUORUM reads?


On Wed, Oct 9, 2013 at 7:41 PM, Chris Burroughs
wrote:


I have not been able to do the test with the 2nd cluster, but have been
given a disturbing data point.  We had a disk slowly fail causing a
significant performance degradation that was only resolved when the "sick"
node was killed.
  * Perf in DC w/ sick disk: 
http://i.imgur.com/W1I5ymL.**png?1<http://i.imgur.com/W1I5ymL.png?1>
  * perf in other DC: 
http://i.imgur.com/gEMrLyF.**png?1<http://i.imgur.com/gEMrLyF.png?1>

Not only was a single slow node able to cause an order of magnitude
performance hit in a dc, but the other dc faired *worse*.


On 09/18/2013 08:50 AM, Chris Burroughs wrote:


On 09/17/2013 04:44 PM, Robert Coli wrote:


On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
**wrote:

  We have a 2 DC cluster running cassandra 1.2.9.  They are in actual

physically separate DCs on opposite coasts of the US, not just logical
ones.  The primary use of this cluster is CL.ONE reads out of a single
column family.  My expectation was that in such a scenario restarts
would
have minimal impact in the DC where the restart occurred, and no
impact in
the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs "A" and "B").



Did you end up filing a JIRA on this, or some other outcome?

=Rob




No.  I am currently in the process of taking a 2nd cluster from being
single to dual DC.  Once that is done I was going to repeat the test
with each cluster and gather as much information as reasonable.

Re: Multi-dc restart impact

2013-10-09 Thread Chris Burroughs

I have not been able to do the test with the 2nd cluster, but have been 
given a disturbing data point.  We had a disk slowly fail causing a 
significant performance degradation that was only resolved when the 
"sick" node was killed.

 * Perf in DC w/ sick disk: http://i.imgur.com/W1I5ymL.png?1
 * perf in other DC: http://i.imgur.com/gEMrLyF.png?1

Not only was a single slow node able to cause an order of magnitude 
performance hit in a dc, but the other dc faired *worse*.



On 09/18/2013 08:50 AM, Chris Burroughs wrote:

On 09/17/2013 04:44 PM, Robert Coli wrote:

On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
wrote:


We have a 2 DC cluster running cassandra 1.2.9.  They are in actual
physically separate DCs on opposite coasts of the US, not just logical
ones.  The primary use of this cluster is CL.ONE reads out of a single
column family.  My expectation was that in such a scenario restarts
would
have minimal impact in the DC where the restart occurred, and no
impact in
the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs "A" and "B").



Did you end up filing a JIRA on this, or some other outcome?

=Rob




No.  I am currently in the process of taking a 2nd cluster from being
single to dual DC.  Once that is done I was going to repeat the test
with each cluster and gather as much information as reasonable.

gossip settling and bootstrap problems

2013-10-07 Thread Chris Burroughs

I've been running into a variety of tricky to diagnose problems recently 
that could be summarized as "bootstrap & related tasks fail without 
extra hacky sleep time".


This is a sample edited log file for bootstrapping a node that captures 
the general dynamics: http://pastebin.com/yeN9USLt  This build has been 
modified (from 1.2.10) to sleep 4*RING_DELAY in 
StorageService.bootstrap().  A few notes:

 * At 30s nodes are still flapping UP and DOWN
 * handshaking is still going strong at 90s
 * Things do stabilize; they don't flap indefinitely
 * Bootstrap succeeds once it starts.  In this particular cluster a 
default RING_DELAY/build (30s) fails every time.


Ping times, TCP retransmit, and other general network stuff look fine. 
There are several different tickets (some from me) that reference what 
seemed to me to be possibly similar or at least correlated issues:
 * CASSANDRA-4288 : prevent thrift server from starting before gossip 
has settled

 * CASSANDRA-5815 : NPE from migration manager
 * CASSANDRA-5915 : node flapping prevents replace_node from succeeding 
consistently
 * CASSANDRA-6156 : Poor resilience and recovery for bootstrapping node 
- "unable to fetch range"

 * CASSANDRA-6127 : vnodes don't scale to hundreds of nodes

I suspect that a combination of factors is causing gossip to take longer 
to stabilize:

 * vnodes
 * (cross country or greater) multi-dc
 * bigger than a test cluster (> 50 nodes)
 * reconnecting snitch

What are other people seeing in their clusters?  Doe anyone routinely 
change RING_DELAY (google finds precious few references)?

Re: Nodes separating from the ring

2013-09-23 Thread Chris Burroughs

I have observed one problem with an inconsistent ring that is 
superficially similar (node thinks it's up but peers disagree) and noted 
details in CASSANDRA-6082.  However, it does not sound like the details 
of either the symptoms, or the resolution match what you describe.


If you have not already, running nodetool goossipinfo might give you 
more clues than `status`.


On 09/13/2013 10:48 AM, Dave Cowen wrote:

Hi, all -

We've been running Cassandra 1.1.12 in production since February, and have
experienced a vexing problem with an arbitrary node "falling out of" or
separating from the ring on occasion.

When a node "falls out" of the ring, running nodetool ring on the
misbehaving node shows that the misbehaving node believes that  is Up, but
that the rest of the ring is Down, and the rest of the ring has question
marks listed for load. nodetool ring on any of the other nodes, however,
shows the misbehaving node as Down but everything else is up.

Shutting down and restarting the misbehaving node does not result in
changed behavior. We can only get the misbehaving node to rejoin the ring
by shutting it down, running nodetool removetoken 
and nodetool removetoken force elsewhere in the ring. After the node's
token has been removed from the ring, it will rejoin and behave normally
when it is restarted.

This is not a frequent occurrence - we can go months between this
happening. It most commonly occurs when a different node is brought down
and then back up, but it can happen spontaneously. This is also not
associated with a network connectivity event; we've seen no interruption in
the nodes being able to communicate over the network. As above, it's also
not isolated to a single node; we've seen this behavior on multiple nodes.

This has occurred with both the identical seeds specified in cassandra.yaml
on each node, and also when we remove the node from its own seed list (so
any seed won't try to auto-bootstrap from itself). Seeds have always been
up and available.

Has anyone else seen similar behavior? For obvious reasons, we hate seeing
one of the nodes suddenly "fall out" and require intervention when we flap
another node, or for no reason at all.

Thanks,

Dave

Re: I don't understand shuffle progress

2013-09-18 Thread Chris Burroughs


http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_add_dc_to_cluster_t.html

This is a basic outline.


On 09/18/2013 10:32 AM, Juan Manuel Formoso wrote:

I really like this idea. I can create a new cluster and have it replicate
the old one, after it finishes I can remove the original.

Any good resource that explains how to add a new datacenter to a live
single dc cluster that anybody can recommend?


On Wed, Sep 18, 2013 at 9:58 AM, Chris Burroughs
wrote:


On 09/17/2013 09:41 PM, Paulo Motta wrote:


So you're saying the only feasible way of enabling VNodes on an upgraded
C*
1.2 is by doing fork writes to a brand new cluster + bulk load of sstables
from the old cluster? Or is it possible to succeed on shuffling, even if
that means waiting some weeks for the shuffle to complete?



In a multi "DC" cluster situation you *should* be able to bring up a new
DC with vnodes, bootstrap it, and then decommission the old cluster.

Re: I don't understand shuffle progress

2013-09-18 Thread Chris Burroughs


On 09/17/2013 09:41 PM, Paulo Motta wrote:

So you're saying the only feasible way of enabling VNodes on an upgraded C*
1.2 is by doing fork writes to a brand new cluster + bulk load of sstables
from the old cluster? Or is it possible to succeed on shuffling, even if
that means waiting some weeks for the shuffle to complete?


In a multi "DC" cluster situation you *should* be able to bring up a new 
DC with vnodes, bootstrap it, and then decommission the old cluster.

Re: Multi-dc restart impact

2013-09-18 Thread Chris Burroughs


On 09/17/2013 04:44 PM, Robert Coli wrote:

On Thu, Sep 5, 2013 at 6:14 AM, Chris Burroughs
wrote:


We have a 2 DC cluster running cassandra 1.2.9.  They are in actual
physically separate DCs on opposite coasts of the US, not just logical
ones.  The primary use of this cluster is CL.ONE reads out of a single
column family.  My expectation was that in such a scenario restarts would
have minimal impact in the DC where the restart occurred, and no impact in
the remote DC.

We are seeing instead that restarts in one DC have a dramatic impact on
performance in the other (let's call them DCs "A" and "B").



Did you end up filing a JIRA on this, or some other outcome?

=Rob




No.  I am currently in the process of taking a 2nd cluster from being 
single to dual DC.  Once that is done I was going to repeat the test 
with each cluster and gather as much information as reasonable.

Multi-dc restart impact

2013-09-05 Thread Chris Burroughs

We have a 2 DC cluster running cassandra 1.2.9.  They are in actual 
physically separate DCs on opposite coasts of the US, not just logical 
ones.  The primary use of this cluster is CL.ONE reads out of a single 
column family.  My expectation was that in such a scenario restarts 
would have minimal impact in the DC where the restart occurred, and no 
impact in the remote DC.


We are seeing instead that restarts in one DC have a dramatic impact on 
performance in the other (let's call them DCs "A" and "B").


Test scenario on a node in DC "A":
 * disablegossip: no change
 * drain: no change
 * stop node: no change
 * start node again: Large increase in latency in both DCs A *and* B

This is a graph showing the increase in latency 
(org.apache.cassandra.metrics.ClientRequest.Latency.Read.95percentile) 
from DC *B* http://i.imgur.com/OkIQyXI.png  (Actual clients report 
similar numbers that agree with this server side measurement).  Latency 
jumps by over an order of magnitude and out of SLAs.  (I would prefer 
restarting to not cause a latency spike in either DC, but the one 
induced in the remote DC is particularly concerning.)


However, the node that was restarted reports only a minor increase in 
latency http://i.imgur.com/KnGEJrE.png  This is confusing from several 
different angles:

 * I would not expect any cross-dc reads to normally be occurring
 * If there were cross DC reads, they would take 50+ ms instead of < 5 
ms normally reported
 * If the node that was restarted was still somehow involved it reads, 
it's reporting shows it can only account for a small amount of the 
latency increase.


Some possible relevant configurations:
 * GossipingPropertyFileSnitch
 * dynamic_snitch_update_interval_in_ms: 100
 * dynamic_snitch_reset_interval_in_ms: 60
 * dynamic_snitch_badness_threshold: 0.1
 * read_repair_chance=0.01 and dclocal_read_repair_chance=0.1 (same 
type of behavior was observed with just read_repair_chance=0.1)


Has anyone else observed similar behavior and found a way to limit it? 
This seems like something that ought not to happen but without knowing 
why it is occurring I'm not sure how to stop it.

Re: row cache

2013-09-03 Thread Chris Burroughs


On 09/01/2013 03:06 PM, Faraaz Sareshwala wrote:

Yes, that is correct.

The SerializingCacheProvider stores row cache contents off heap. I believe you
need JNA enabled for this though. Someone please correct me if I am wrong here.

The ConcurrentLinkedHashCacheProvider stores row cache contents on the java heap
itself.



Naming things is hard.  Both caches are in memory and are backed by a 
ConcurrentLinkekHashMap.  In the case of the SerializingCacheProvider 
the *values* are stored in off heap buffers.  Both must store a half 
dozen or so objects (on heap) per entry 
(org.apache.cassandra.cache.RowCacheKey, 
com.googlecode.concurrentlinkedhashmap.ConcurrentLinkedHashMap$WeightedValue, 
java.util.concurrent.ConcurrentHashMap$HashEntry, etc).  It would 
probably be better to call this a "mixed-heap" rather than off-heap 
cache.  You may find the number of entires you can hold without gc 
problems to be surprising low (relative to say memcached, or physical 
memory on modern hardware).


Invalidating a column with SerializingCacheProvider invalidates the 
entire row while with ConcurrentLinkedHashCacheProvider it does not. 
SerializingCacheProvider does not require JNA.


Both also use memory estimation of the size (of the values only) to 
determine the total number of entries retained.  Estimating the size of 
the totally on-heap ConcurrentLinkedHashCacheProvider has historically 
been dicey since we switched from sizing in entries, and it has been 
removed in 2.0.0.


As said elsewhere in this thread the utility of the row cache varies 
from "absolutely essential" to "source of numerous problems" depending 
on the specifics of the data model and request distribution.

multi-dc clusters with 'local' ips and no vpn

2013-06-17 Thread Chris Burroughs

Cassandra makes the totally reasonable assumption that the entire
cluster is in one routable address space.  We unfortunately had a
situation where:
 * nodes can talk to each other in the same dc on an internal address,
but not talk to each other over their external 1:1 NAT address.
 * nodes can talk to nodes in the other dc over the external address,
but there is no usable shared internal address space they can talk over

In case anyone else finds themselves in the same situation we have what
we think is a working solution in pre-production.  CASSANDRA-5630
handles the "reconnect trick" to prefer the local ip when in the same
DC.  And some iptables rules allow the local nodes to do the initial
gossiping with each other before that switch.

for each node in same dc:
'iptables -t nat -A OUTPUT -j DNAT -p tcp --dst %s --dport 7000 -o
eth0  --to-destination %s' % (ext_ip, local_ip)

Cassandra DC Meetup: Cassandra on flash storage

2013-02-19 Thread Chris Burroughs

http://www.meetup.com/Cassandra-DC-Meetup/events/104345302/

This month we will have a presentation by our very own Matt Kennedy
about running Cassandra on super fancy flash. If you are in the DC are
we would love to see you stop by.

SurgeCon 2012

2012-09-05 Thread Chris Burroughs

Surge [1] is scalability focused conference in late September hosted in
Baltimore.  It's a pretty cool conference with a good mix of
operationally minded people interested in scalability, distributed
systems, systems level performance and good stuff like that.  You should
go! [2]

For those of you who like historical trivia Mike Malone gave a well
recieved Cassandra talk at the first SurgeCon in 2010 [3].

This year there is organised room for BoF's and such with several
one-hour slots Wednesday and Thursday evenings, between 9 p.m. and
midnight for BoFs.  Last year a few of us got together informally around
lunch time [4].

Interested in getting together again this year?  Think we have critical
mass for a BoF?

[1] http://omniti.com/surge/2012

[2] http://omniti.com/surge/2012/register

[3] http://omniti.com/surge/2010/speakers/mike-malone

[4]
http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/%3c4e82140a.5070...@gmail.com%3E

Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Chris Burroughs

Well I obviously think it would be handy.  If this get's proposed and
end's up using stream-lib don't be shy about asking for help.

On a more general note, it would be great to see the special case
Counter code become more general atomic operation code.

On 06/13/2012 01:15 PM, Utku Can Topçu wrote:
> Hi Yuki,
> 
> I think I should have used the word discussion instead of proposal for the
> mailing subject. I have quite some of a design in my mind but I think it's
> not yet ripe enough to formalize. I'll try to simplify it and open a Jira
> ticket.
> But first I'm wondering if there would be any excitement in the community
> for such a feature.
> 
> Regards,
> Utku
> 
> On Wed, Jun 13, 2012 at 7:00 PM, Yuki Morishita  wrote:
> 
>> You can open JIRA ticket at
>> https://issues.apache.org/jira/browse/CASSANDRA with your proposal.
>>
>> Just for the input:
>>
>> I had once implemented HyperLogLog counter to use internally in Cassandra,
>> but it turned out I didn't need it so I just put it to gist. You can find
>> it here: https://gist.github.com/2597943
>>
>> The above implementation and most of the other ones (including stream-lib)
>> implement the optimized version of the algorithm which counts up to 10^9,
>> so may need some work.
>>
>> Other alternative is self-learning bitmap (
>> http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my
>> understanding, is more memory efficient when counting small values.
>>
>> Yuki
>>
>> On Wednesday, June 13, 2012 at 11:28 AM, Utku Can Topçu wrote:
>>
>> Hi All,
>>
>> Let's assume we have a use case where we need to count the number of
>> columns for a given key. Let's say the key is the URL and the column-name
>> is the IP address or any cardinality identifier.
>>
>> The straight forward implementation seems to be simple, just inserting the
>> IP Adresses as columns under the key defined by the URL and using get_count
>> to count them back. However the problem here is in case of large rows
>> (where too many IP addresses are in); the get_count method has to
>> de-serialize the whole row and calculate the count. As also defined in the
>> user guides, it's not an O(1) operation and it's quite costly.
>>
>> However, this problem seems to have better solutions if you don't have a
>> strict requirement for the count to be exact. There are streaming
>> algorithms that will provide good cardinality estimations within a
>> predefined failure rate, I think the most popular one seems to be the
>> (Hyper)LogLog algorithm, also there's an optimal one developed recently,
>> please check http://dl.acm.org/citation.cfm?doid=1807085.1807094
>>
>> If you want to take a look at the Java implementation for LogLog,
>> Clearspring has both LogLog and space optimized HyperLogLog available at
>> https://github.com/clearspring/stream-lib
>>
>> I don't see a reason why this can't be implemented in Cassandra. The
>> distributed nature of all these algorithms can easily be adapted to
>> Cassandra's model. I think most of us would love to see come cardinality
>> estimating columns in Cassandra.
>>
>> Regards,
>> Utku
>>
>>
>>
>

Re: Distinct Counter Proposal for Cassandra

2012-06-29 Thread Chris Burroughs

On 06/13/2012 01:00 PM, Yuki Morishita wrote:
> The above implementation and most of the other ones (including stream-lib) 
> implement the optimized version of the algorithm which counts up to 10^9, so 
> may need some work.
> 
> Other alternative is self-learning bitmap 
> (http://ect.bell-labs.com/who/aychen/sbitmap4p.pdf) which, in my 
> understanding, is more memory efficient when counting small values.

The closest we could get to a one-size fits all would probably be an
adaptive counting scheme that uses linear counting (or self-learning
bitmap, didn't know about that one!) for small expected cardinalities
and a LogLog variant for higher ones.  It's more choices to make, but
choosing between "not too big" and "really really big" doesn't seem like
an unreasonable burden to me.

Re: Row caching in Cassandra 1.1 by column family

2012-06-18 Thread Chris Burroughs

Check out the "rows_cached" CF attribute.

On 06/18/2012 06:01 PM, Oleg Dulin wrote:
> Dear distinguished colleagues:
> 
> I don't want all of my CFs cached, but one in particular I do.
> 
> How can I configure that ?
> 
> Thanks,
> Oleg
>

Re: 1.0.3 CLI oddities

2011-12-11 Thread Chris Burroughs

Sounds like https://issues.apache.org/jira/browse/CASSANDRA-3558 and the
other tickets reference there.

On 11/28/2011 05:05 AM, Janne Jalkanen wrote:
> Hi!
> 
> (Asked this on IRC too, but didn't get anyone to respond, so here goes...)
> 
> Is it just me, or are these real bugs? 
> 
> On 1.0.3, from CLI: "update column family XXX with gc_grace = 36000;" just 
> says "null" with nothing logged.  Previous value is the default.
> 
> Also, on 1.0.3, "update column family XXX with 
> compression_options={sstable_compression:SnappyCompressor,chunk_length_kb:64};"
>  returns "Internal error processing system_update_column_family" and log says 
> "Invalid negative or null chunk_length_kb" (stack trace below)
> 
> Setting the compression options worked on 1.0.0 when I was testing (though my 
> 64 kB became 64 MB, but I believe this was fixed in 1.0.3.)
> 
> Did the syntax change between 1.0.0 and 1.0.3? Or am I doing something wrong? 
> 
> The database was upgraded from 0.6.13 to 1.0.0, then scrubbed, then 
> compression options set to some CFs, then upgraded to 1.0.3 and trying to set 
> compression on other CFs.
> 
> Stack trace:
> 
> ERROR [pool-2-thread-68] 2011-11-28 09:59:26,434 Cassandra.java (line 4038) 
> Internal error processing system_update_column_family
> java.lang.RuntimeException: java.util.concurrent.ExecutionException: 
> java.io.IOException: org.apache.cassandra.config.ConfigurationException: 
> Invalid negative or null chunk_length_kb
>   at 
> org.apache.cassandra.thrift.CassandraServer.applyMigrationOnStage(CassandraServer.java:898)
>   at 
> org.apache.cassandra.thrift.CassandraServer.system_update_column_family(CassandraServer.java:1089)
>   at 
> org.apache.cassandra.thrift.Cassandra$Processor$system_update_column_family.process(Cassandra.java:4032)
>   at 
> org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
>   at 
> org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:680)
> Caused by: java.util.concurrent.ExecutionException: java.io.IOException: 
> org.apache.cassandra.config.ConfigurationException: Invalid negative or null 
> chunk_length_kb
>   at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:83)
>   at 
> org.apache.cassandra.thrift.CassandraServer.applyMigrationOnStage(CassandraServer.java:890)
>   ... 7 more
> Caused by: java.io.IOException: 
> org.apache.cassandra.config.ConfigurationException: Invalid negative or null 
> chunk_length_kb
>   at 
> org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:78)
>   at org.apache.cassandra.db.migration.Migration.apply(Migration.java:156)
>   at 
> org.apache.cassandra.thrift.CassandraServer$2.call(CassandraServer.java:883)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   ... 3 more
> Caused by: org.apache.cassandra.config.ConfigurationException: Invalid 
> negative or null chunk_length_kb
>   at 
> org.apache.cassandra.io.compress.CompressionParameters.validateChunkLength(CompressionParameters.java:167)
>   at 
> org.apache.cassandra.io.compress.CompressionParameters.create(CompressionParameters.java:52)
>   at org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:796)
>   at 
> org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:74)
>   ... 7 more
> ERROR [MigrationStage:1] 2011-11-28 09:59:26,434 AbstractCassandraDaemon.java 
> (line 133) Fatal exception in thread Thread[MigrationStage:1,5,main]
> java.io.IOException: org.apache.cassandra.config.ConfigurationException: 
> Invalid negative or null chunk_length_kb
>   at 
> org.apache.cassandra.db.migration.UpdateColumnFamily.applyModels(UpdateColumnFamily.java:78)
>   at org.apache.cassandra.db.migration.Migration.apply(Migration.java:156)
>   at 
> org.apache.cassandra.thrift.CassandraServer$2.call(CassandraServer.java:883)
>   at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:680)
> Caused by: org.apache.cassandra.config.ConfigurationException: Invalid 
> negative or null chunk_length_kb
>   at 
> org.apache.cassandra.io.compress.CompressionParameters.validateChunkLength(CompressionParameters.java:167)
>   at

Re: Second Cassandra users survey

2011-11-14 Thread Chris Burroughs

 - It would be super cool if all of that counter work made it possible
to support other atomic data types (sets? CAS?  just pass a assoc/commun
Function to apply).
 - Again with types, pluggable type specific compression.
 - Wishy washy wish: Simpler "elasticity"  I would like to go from
6-->8-->7 nodes without each of those being an annoying fight with tokens.
 - Gossip as library.  Gossip/failure detection is something C* seems to
have gotten particularly right (or at least it's something that has not
needed to change much).  It would be cool to use Cassandra's gossip
protocol as distributed systems building tool a la ZooKeeper.

On 11/01/2011 06:59 PM, Jonathan Ellis wrote:
> Hi all,
> 
> Two years ago I asked for Cassandra use cases and feature requests.
> [1]  The results [2] have been extremely useful in setting and
> prioritizing goals for Cassandra development.  But with the release of
> 1.0 we've accomplished basically everything from our original wish
> list. [3]
> 
> I'd love to hear from modern Cassandra users again, especially if
> you're usually a quiet lurker.  What does Cassandra do well?  What are
> your pain points?  What's your feature wish list?
> 
> As before, if you're in stealth mode or don't want to say anything in
> public, feel free to reply to me privately and I will keep it off the
> record.
> 
> [1] 
> http://www.mail-archive.com/cassandra-dev@incubator.apache.org/msg01148.html
> [2] 
> http://www.mail-archive.com/cassandra-user@incubator.apache.org/msg01446.html
> [3] http://www.mail-archive.com/dev@cassandra.apache.org/msg01524.html
>

Re: CMS GC initial-mark taking 6 seconds , bad?

2011-10-20 Thread Chris Burroughs

On 10/20/2011 09:38 AM, Maxim Potekhin wrote:
> I happen to have 48GB on each machines I use in the cluster. Can I
> assume that I can't really use all of this memory productively? Do you
> have any suggestion related to that? Can I run more than one instance on
> Cassandra on the same box (using different ports) to take advantage of
> this memory, assuming the disk has enough bandwidth?

You are likely to not have good luck with a JVM heap that large.  But
you can:
 - Leave all that memory to the OS page cache.
 - mmap index files
 - use an off heap cache

All of those are productive uses.

Re: ApacheCon meetup?

2011-10-12 Thread Chris Burroughs

On 10/11/2011 12:05 PM, Eric Evans wrote:
> Let's do it.  We can organize an official one, and still grab food
> together if that's not enough. :)

Great!  Thanks for putting this together.

ApacheCon meetup?

2011-10-04 Thread Chris Burroughs

ApacheCon NA is coming up next month.  I suspect there will be at least
a few Cassandra users there (yeah new release!).  Would anyone be
interested in getting together and sharing some stories?  This could
either be a "official" [1] meetup.  Or grabbing food together sometime.

[1] http://wiki.apache.org/apachecon/ApacheMeetupsNa11

Re: Surgecon Meetup?

2011-09-27 Thread Chris Burroughs

So it sounds like there are about a half dozen of us, some coming
Wednesday, others Thursday.  I'll have some Cassandra eye logos out
around lunch both of those days.  If that herds us together then
success!  If not I'll try something more formal.

Looking forward to meeting everyone.

On 09/25/2011 07:27 PM, Chris Burroughs wrote:
> Surge [1] is scalability focused conference in late September hosted in
> Baltimore.  It's a pretty cool conference with a good mix of
> operationally minded people interested in scalability, distributed
> systems, systems level performance and good stuff like that.  You should
> go! [2]
> 
> Anyway, I'll be there if there, and are if any other Cassandra users are
> coming I'm happy to help herd us towards meeting up, lunch, hacking,
> etc.  I *think* there might be some time for structured BoF type
> sessions as well.
> 
> 
> [1] http://omniti.com/surge/2011
> 
> [2] Actually tickets recenlty sold out, you should go in 2012!

Surgecon Meetup?

2011-09-25 Thread Chris Burroughs

Surge [1] is scalability focused conference in late September hosted in
Baltimore.  It's a pretty cool conference with a good mix of
operationally minded people interested in scalability, distributed
systems, systems level performance and good stuff like that.  You should
go! [2]

Anyway, I'll be there if there, and are if any other Cassandra users are
coming I'm happy to help herd us towards meeting up, lunch, hacking,
etc.  I *think* there might be some time for structured BoF type
sessions as well.


[1] http://omniti.com/surge/2011

[2] Actually tickets recenlty sold out, you should go in 2012!

Re: Survey: Cassandra/JVM Resident Set Size increase

2011-07-29 Thread Chris Burroughs

Thanks to everyone who responded (I think I learned a few new tricks
from seeing what you tried and how your monitor).  I didn't see any
patterns in JVM, OS, cassandra versions etc.

At this time I'm confident in saying CASSANDRA-2868 (and thus really
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=7066129) is the culprit.

On 07/12/2011 09:28 AM, Chris Burroughs wrote:
> ### Preamble
> 
> There have been several reports on the mailing list of the JVM running
> Cassandra using "too much" memory.  That is, the resident set size is
>>> (max java heap size + mmaped segments) and continues to grow until the
> process swaps, kernel oom killer comes along, or performance just
> degrades too far due to the lack of space for the page cache.  It has
> been unclear from these reports if there is a pattern.  My hope here is
> that by comparing JVM versions, OS versions, JVM configuration etc., we
> will find something.  Thank you everyone for your time.
> 
> 
> Some example reports:
>  - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
>  -
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html
>  - https://issues.apache.org/jira/browse/CASSANDRA-2868
>  -
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html
>  -
> http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html
> 
> For reference theories include (in no particular order):
>  - memory fragmentation
>  - JVM bug
>  - OS/glibc bug
>  - direct memory
>  - swap induced fragmentation
>  - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity.
> 
> ### Survey
> 
> 1. Do you think you are experiencing this problem?
> 
> 2.  Why? (This is a good time to share a graph like
> http://www.twitpic.com/5fdabn or
> http://img24.imageshack.us/img24/1754/cassandrarss.png)
> 
> 2. Are you using mmap? (If yes be sure to have read
> http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have
> used pmap [or another tool] to rule you mmap and top decieving you.)
> 
> 3. Are you using JNA?  Was mlockall succesful (it's in the logs on startup)?
> 
> 4. Is swap enabled? Are you swapping?
> 
> 5. What version of Apache Cassandra are you using?
> 
> 6. What is the earliest version of Apache Cassandra you recall seeing
> this problem with?
> 
> 7. Have you tried the patch from CASSANDRA-2654 ?
> 
> 8. What jvm and version are you using?
> 
> 9. What OS and version are you using?
> 
> 10. What are your jvm flags?
> 
> 11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize)
> 
> 12. Can you characterise how much GC your cluster is doing?
> 
> 13. Approximately how many read/writes per unit time is your cluster
> doing (per node or the whole cluster)?
> 
> 14.  How are you column families configured (key cache size, row cache
> size, etc.)?
>

Re: cassandra server disk full

2011-07-29 Thread Chris Burroughs

On 07/25/2011 01:53 PM, Ryan King wrote:
> Actually I was wrong– our patch will disable gosisp and thrift but
> leave the process running:
> 
> https://issues.apache.org/jira/browse/CASSANDRA-2118
> 
> If people are interested in that I can make sure its up to date with
> our latest version.

Thanks Ryan.

/me expresses interest.

Zombie nodes when the file system does something "interesting" are not fun.

Re: JNA to avoid swap but physical memory increase

2011-07-15 Thread Chris Burroughs

On 07/15/2011 07:24 AM, Daniel Doubleday wrote:
> Also our experience shows that the jna call does not prevent swapping so the 
> general advice is disable swap.

Can you confirm you don't get the (paraphrasing) "whoops we tried
mlockall but ulimits denied us" message on startup?

Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-14 Thread Chris Burroughs

On 07/13/2011 03:57 PM, Aaron Morton wrote:
> You can always use a dedicated CF for the counters, and use the same row key.

Of course one could do this.  The problem is you are now spending ~2x
disk space on row keys, and app specific client code just became more
complicated.

Survey: Cassandra/JVM Resident Set Size increase

2011-07-12 Thread Chris Burroughs

### Preamble

There have been several reports on the mailing list of the JVM running
Cassandra using "too much" memory. That is, the resident set size is
>>(max java heap size + mmaped segments) and continues to grow until the
process swaps, kernel oom killer comes along, or performance just
degrades too far due to the lack of space for the page cache. It has
been unclear from these reports if there is a pattern. My hope here is
that by comparing JVM versions, OS versions, JVM configuration etc., we
will find something. Thank you everyone for your time.

Some example reports:
- http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
-
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html
- https://issues.apache.org/jira/browse/CASSANDRA-2868
-
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html
-
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html

For reference theories include (in no particular order):
- memory fragmentation
- JVM bug
- OS/glibc bug
- direct memory
- swap induced fragmentation
- some other bad interaction of cassandra/jdk/jvm/os/nio-insanity.

### Survey

1. Do you think you are experiencing this problem?

2. Why? (This is a good time to share a graph like
http://www.twitpic.com/5fdabn or
http://img24.imageshack.us/img24/1754/cassandrarss.png)

2. Are you using mmap? (If yes be sure to have read
http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have
used pmap [or another tool] to rule you mmap and top decieving you.)

3. Are you using JNA? Was mlockall succesful (it's in the logs on startup)?

4. Is swap enabled? Are you swapping?

5. What version of Apache Cassandra are you using?

6. What is the earliest version of Apache Cassandra you recall seeing
this problem with?

7. Have you tried the patch from CASSANDRA-2654 ?

8. What jvm and version are you using?

9. What OS and version are you using?

10. What are your jvm flags?

11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize)

12. Can you characterise how much GC your cluster is doing?

13. Approximately how many read/writes per unit time is your cluster
doing (per node or the whole cluster)?

14. How are you column families configured (key cache size, row cache
size, etc.)?

Re: Storing counters in the standard column families along with non-counter columns ?

2011-07-11 Thread Chris Burroughs

On 07/10/2011 01:09 PM, Aditya Narayan wrote:
> Is there any target version in near future for which this has been promised
> ?

The ticket is problematic in that it would -- unless someone has a
clever new idea -- require breaking thrift compatibility to add it to
the api.  Since is unfortunate since it would be so useful.

If it's in the 0.8.x series it will only be through CQL.

Re: Cassandra DC Upcoming Meetup

2011-07-05 Thread Chris Burroughs

On 06/15/2011 08:57 AM, Chris Burroughs wrote:
> Cassandra DC's first meetup of the pizza and talks variety will be on
> July 6th. There will be an introductory sort of presentation and a
> totally cool one on Pig integration.
> 
> If you are in the DC area it would be great to see you there.
> 
> http://www.meetup.com/Cassandra-DC-Meetup/events/22145481/

My totally anecdotal impression from going to several "Big
Data"/Hadoop/JUG meetups in the DC area  is that there is a reasonable
amount of interest, but not a large amount of production use.  In other
words, this is a great time to bring along your Cassandra Curious
friends and co-workers! Hope to see some of you tomorrow.

Chris Burroughs

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs

On 06/23/2011 01:56 PM, Les Hazlewood wrote:
> Is there a roadmap or time to 1.0?  Even a ballpark time (e.g next year 3rd
> quarter, end of year, etc) would be great as it would help me understand
> where it may lie in relation to my production rollout.

The C* devs are rather strongly inclined against putting too much
meaning in version numbers.  The next major release might be called 1.0.
Or maybe it won't.  Either way it won't be different code or support
from something called 0.9 or 10.0.

September 8th is the feature freeze for the next major release.

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs

On 06/23/2011 02:00 PM, Les Hazlewood wrote:
> This leads me to believe that Cassandra may not be a good idea for a primary
> OLTP data store.  For example "only create a user object if email foo is not
> already in use" or, more generally, "you can't create object X because one
> with an existing constraint already exists".
> 
> Is that a fair assumption?

I think so.  Lacking a built in T for OLTP the amount of hard thinking
you will have to do increases are you want to maintain more constraints.
 The obvious trade off is that instead of transaction you get that
distirbuted horizontal scalability stuff with Cassandra.

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs

On 06/22/2011 07:12 PM, Les Hazlewood wrote:
> Telling me to read the mailing lists and follow the issue tracker and use
> monitoring software is all great and fine - and I do all of these things
> today already - but this is a philosophical recommendation that does not
> actually address my question.  So I chalk this up as an error on my side in
> not being clear in my question - my apologies.  Let me reformulate it :)

For what it's worth that was intended as a concrete suggestion.  We
adopted Cassandra a year ago when (IMHO) it was a mistake to do so it
without the willingness to develop sufficient in house expertise to
internally patch/fork/debug if needed.  Things are more mature now, best
practices more widespread etc., but you should judge that yourself.

In the spirit of your re-formulated questions:
 - Read-before-write is a Cassandra anti-pattern, avoid it if at all
possible.
 - Those optional lines in the env script about GC logging?  Uncomment
them on at least some of your boxes.
 - use MLOCKALL+mmap, or standard io, but not mmap without MLOCKALL.

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs

On 06/22/2011 10:03 PM, Edward Capriolo wrote:
> I have not read the original thread concerning the problem you mentioned.
> One way to avoid OOM is large amounts of RAM :) On a more serious note most
> OOM's are caused by setting caches or memtables too large. If the OOM was
> caused by a software bug, the cassandra devs are on the ball and move fast.
> I still suggest not jumping into a release right away. 

For what it's worth  that particular thread was about the kernel oom
killer, which is a good example of a the kind of gotcha that has caused
several people to chime in with the importance of monitoring both
Cassandra and the OS.

Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Chris Burroughs

Do all of the reductions in Used on that graph correspond to node restarts?

My Zabbix for reference: http://img194.imageshack.us/img194/383/2weekmem.png


On 06/22/2011 06:35 PM, Sasha Dolgy wrote:
> http://www.twitpic.com/5fdabn
> http://www.twitpic.com/5fdbdg
> 
> i do love a good graph.  two of the weekly memory utilization graphs
> for 2 of the 4 servers from this ring... week 21 was a nice week ...
> the week before 0.8.0 went out proper.  since then, bumped up to 0.8
> and have seen a steady increase in the memory consumption (used) but
> have not seen the swap do what it did ...and the buffered/cached seems
> much better
> 
> -sd
> 
> On Thu, Jun 23, 2011 at 12:09 AM, Chris Burroughs
>  wrote:
>>
>> In `free` terms, by pegged do you mean that free "Mem" was 0, or "-/+
>> buffers/cache" as 0?

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Chris Burroughs

On 06/22/2011 05:33 PM, Les Hazlewood wrote:
> Just to be clear:
> 
> I understand that resources like [1] and [2] exist, and I've read them.  I'm
> just wondering if there are any 'gotchas' that might be missing from that
> documentation that should be considered and if there are any recommendations
> in addition to these documents.
> 
> Thanks,
> 
> Les
> 
> [1] http://www.datastax.com/docs/0.8/operations/index
> [2] http://wiki.apache.org/cassandra/Operations
> 

Well if they new some secret gotcha the dutiful cassandra operators of
the world would update the wiki.

The closest thing to a 'gotcha' is that neither Cassandra nor any other
technology is going to get you those nines.  Humans will need to commit
to reading the mailing lists, following JIRA, and understanding what the
code is doing.  And humans will need to commit to combine that
understanding with monitoring and alerting to figure out all of the "it
depends" for your particular case.

Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread Chris Burroughs

On 06/22/2011 08:53 AM, Sasha Dolgy wrote:
> Yes ... this is because it was the OS that killed the process, and
> wasn't related to Cassandra "crashing".  Reviewing our monitoring, we
> saw that memory utilization was pegged at 100% for days and days
> before it was finally killed because 'apt' was fighting for resource.
> At least, that's as far as I got in my investigation before giving up,
> moving to 0.8.0 and implementing 24hr nodetool repair on each node via
> cronjobso far ... no problems.

In `free` terms, by pegged do you mean that free "Mem" was 0, or "-/+
buffers/cache" as 0?

Re: BloomFilterFalsePositives equals 1.0

2011-06-22 Thread Chris Burroughs

To be precise, you made n requests for non-existent keys, got n negative
responses, and BloomFilterFalsePositives also went up by n?

On 06/21/2011 11:06 PM, Preston Chang wrote:
> Hi,all:
>  I have a problem with bloom filter. When made a test which tried to get
> some nonexistent keys, it seemed that the bloom filter does not work. The
> 'BloomFilterFalseRatio' was 1.0 and the 'BloomFilterFalsePositives' was
> rising and the disk I/O utils reached 100% according to 'iostat'.
> 
> I found the patch in
> https://issues.apache.org/jira/browse/CASSANDRA-2637 , but in my cluster key
> cache had been enabled already.  My Cassandra version is 0.7.3. There are 3
> nodes and RF is 3.
> 
> Thanks for your help.
>

Cassandra DC Upcoming Meetup

2011-06-15 Thread Chris Burroughs

Cassandra DC's first meetup of the pizza and talks variety will be on
July 6th. There will be an introductory sort of presentation and a
totally cool one on Pig integration.

If you are in the DC area it would be great to see you there.

http://www.meetup.com/Cassandra-DC-Meetup/events/22145481/

Re: Data directories

2011-06-09 Thread Chris Burroughs

On 06/08/2011 05:54 AM, Héctor Izquierdo Seliva wrote:
> Is there a way to control what sstables go to what data directory? I
> have a fast but space limited ssd, and a way slower raid, and i'd like
> to put latency sensitive data into the ssd and leave the other data in
> the raid. Is this possible? If not, how well does cassandra play with
> symlinks?
> 

Another option would be to use the ssd as a block level cache with
something like flashcache .

Re: Index interval tuning

2011-05-11 Thread Chris Burroughs

On 05/10/2011 10:24 PM, aaron morton wrote:
> What version and what were the values for RecentBloomFilterFalsePositives and 
> BloomFilterFalsePositives ?
> 
> The bloom filter metrics are updated in SSTableReader.getPosition() the only 
> slightly odd thing I can see is that we do not count a key cache hit a a true 
> positive for the bloom filter. If there were a lot of key cache hits and a 
> few false positives the ratio would be wrong. I'll ask around, does not seem 
> to apply to Hectors case though. 

0.7.1  No key cache.

BloomFilterFalsePositives: 48130
Read Count: 153973494
RecentBloomFilterFalsePositives: 4, 1, 2, 0, 0, 1

Re: Index interval tuning

2011-05-10 Thread Chris Burroughs

On 05/10/2011 02:12 PM, Peter Schuller wrote:
>> That reminds me, my false positive ration is stuck at 1.0, so I guess
>> bloom filters aren't doing a lot for me.
> 
> That sounds unlikely unless you're hitting some edge case like reading
> a particular row that happened to be a collision, and only that row.
> This is from JMX stats on the column family store?
> 

(From jmx)  I also see BloomFilterFalseRatio stuck at 1.0 on my
production nodes.  The only values that RecentBloomFilterFalseRatio had
over the past several minutes were 0.0 and 1.0.  While I can't prove
that isn't accurate, it is very suspicions.

The code looked reasonable until I got to SSTableReader, which was too
complicated to just glance through.

Re: Native heap leaks?

2011-05-05 Thread Chris Burroughs

On 2011-05-05 06:30, Hannes Schmidt wrote:
> This was my first thought, too. We switched to mmap_index_only and
> didn't see any change in behavior. Looking at the smaps file attached
> to my original post, one can see that the mmapped index files take up
> only a minuscule part of RSS.

I have not looked into smaps before. But it actually seems odd that that
mmaped Index files are taking up so *little memory*.  Are they only a
few kb on disk?  Is this a snapshot taken shortly after the process
started or before the OOM killer is presumably about to come along.  How
long does it take to go from 1.1 G to 2.1 G resident?  Either way, it
would be worthwhile to set one node to standard io to make sure it's
really not mmap causing the problem.

Anyway, assuming it's not mmap, here are the other similar threads on
the topic.  Unfortunately none of them claim an obvious solution:

http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
http://www.mail-archive.com/user@cassandra.apache.org/msg08063.html
http://www.mail-archive.com/user@cassandra.apache.org/msg12036.html
http://mail.openjdk.java.net/pipermail/hotspot-dev/2011-April/004091.html

Cassandra Meetup in DC

2011-05-02 Thread Chris Burroughs

http://www.meetup.com/Cassandra-DC-Meetup/


*What*: First Cassandra DC Meetup

*When*: Thursday, May 12, 2011 at 6:30 PM

*Where*: Northside Social Coffee & Wine - 3211 Wilson Blvd Arlington, VA


I'm pleased to announce the the first Cassandra DC Meetup
. Come have
a drink, meet your fellow members, talk about Apache Cassandra, discuss
Greek mythological prophets, and what you want out of the group.

flashcache experimentation

2011-04-18 Thread Chris Burroughs

https://github.com/facebook/flashcache/

"FlashCache is a general purpose writeback block cache for Linux."

We have a case where:
 - Access to data is not uniformly random (let's say Zipfian).
 - The "hot" set > RAM.
 - Size of disk is such that buying enough SSDs, fast drives, multiple
drives, etc would be undesirable.

This seems like a good case for flashcache.  However, as far as I can
tell from searching no one has tried this and posted any results.  I was
wondering if anyone has tried flashcache in a similar situation with
Cassandra and if so how the experience went.

Re: quick repair tool question

2011-04-12 Thread Chris Burroughs

On 04/12/2011 11:11 AM, Jonathan Colby wrote:
> I'm not sure if this is the "kosher" way to rebuild the sstable data, but it 
> seemed to work.  

http://wiki.apache.org/cassandra/Operations#Handling_failure

Option #3.

Re: CL.ONE reads / RR / badness_threshold interaction

2011-04-12 Thread Chris Burroughs

On 04/12/2011 06:27 PM, Peter Schuller wrote:
>> So to increase pinny-ness I'll further reduce RR chance and set a
>> badness threshold.  Thanks all.
> 
> Just be aware that, assuming I am not missing something, while this
> will indeed give you better cache locality under normal circumstances
> - once that "closest" node does go down, traffic will then go to a
> node which will have potentially zero cache hit rate on that data
> since all reads up to that point were taken by the node that just went
> down.
> 
> So it's not an obvious win depending.


Yeah there less than great behaviour when nodes are restarted or
otherwise go down with this configuration.  Probably still preferable
for my current situation.  Other's mileage may vary.


http://img27.imageshack.us/img27/85/cacherestart.png

Analysing hotspot gc logs

2011-04-11 Thread Chris Burroughs

To avoid taking my own thread [1] off on a tangent.  Does anyone have a
reccomendation for a tool to graphical analysis (ie make useful graphs)
out of hoptspot gc logs?  Google searches have turned up several results
along the lines of "go try this zip file" [2].

[1] http://www.mail-archive.com/user@cassandra.apache.org/msg12134.html

[2]
http://mail.openjdk.java.net/pipermail/hotspot-gc-use/2009-August/000420.html

Re: Minor Follow-up: reduced cached mem; resident set size growth

2011-04-08 Thread Chris Burroughs

On 04/05/2011 03:04 PM, Chris Burroughs wrote:

> I have gc logs if anyone is interested.

This is from a node with standard io, jna enabled, but limits were not
set for mlockall to succeed.  One can see -/+ buffers/cache free
shrinking and the C* pid's RSS growing.


Includes several days of:
gc log
free -s
/proc/$PID/status

http://www.filefactory.com/file/ca94892/n/04-08.tar.gz

Please enjoy!  (If there is a preferred way to share the tarball let me
know.)

Re: CL.ONE reads / RR / badness_threshold interaction

2011-04-07 Thread Chris Burroughs

Peter, thank you for the extremely detailed reply.

To now answer my own question, the critical points that are different
from what I said earlier are: that CL.ONE does prefer *one* node (which
one depending on snitch) and that RR uses digests (which are not
mentioned on the wiki page [1]) instead of comparing raw requests.
Totally tangential, but in the case of CL.ONE with narrow rows making
the request and taking the fastest would probably be better, but having
things work both ways depending on row size sounds painfully
complicated.  (As Aaron points out this is not how things work now.)

I am assuming that RR digests save on bandwidth, but to generate the
digest with a row cache miss the same number of disk seeks are required
(my nemesis is disk io).

So to increase pinny-ness I'll further reduce RR chance and set a
badness threshold.  Thanks all.


[1] http://wiki.apache.org/cassandra/ReadRepair

CL.ONE reads / RR / badness_threshold interaction

2011-04-06 Thread Chris Burroughs

My understanding for For CL.ONE.  For the node that receives the request:

(A) If RR is enabled and this node contains the needed row --> return
immediately and do RR to remaining replicas in background.
(B) If RR is off and this node contains the needed row --> return the
needed data immediately.
(C) If this node does not have the needed row --> regardless of RR ask
all replicas and return the first result.


However case (C) as I have described it does not allow for any notion of
'pinning' as mentioned for dynamic_snitch_badness_threshold:

# if set greater than zero and read_repair_chance is < 1.0, this will allow
# 'pinning' of replicas to hosts in order to increase cache capacity.
# The badness threshold will control how much worse the pinned host has
to be
# before the dynamic snitch will prefer other replicas over it.  This is
# expressed as a double which represents a percentage.  Thus, a value of
# 0.2 means Cassandra would continue to prefer the static snitch values
# until the pinned host was 20% worse than the fastest.


The wiki states CL.ONE "Will return the record returned by the first
replica to respond" [1] implying that the request goes to multiple
replicas, but datastax's docs state that only one node will receive the
request ("Returns the response from *the* closest replica, as determined
by the snitch configured for the cluster" [2]).

Could someone clarify how CL.ONE reads with RR off work?


[1] http://wiki.apache.org/cassandra/API
[2]
http://www.datastax.com/docs/0.7/consistency/index#choosing-consistency-levels
 emphasis added

Re: Minor Follow-up: reduced cached mem; resident set size growth

2011-04-06 Thread Chris Burroughs

On 04/05/2011 04:38 PM, Peter Schuller wrote:
>> - Different collectors: -XX:+UseParallelGC -XX:+UseParallelOldGC
> 
> Unless you also removed the -XX:+UseConcMarkSweepGC I *think* it takes
> precedence, so that the above options would have no effect. I didn't
> test. In either case, did you definitely confirm CMS was no longer
> being used? (Should be pretty obvious if you ran with
> -XX:+PrintGCDetails which looks plenty different w/o CMS)
> 

More precisely, I did this:

# GC tuning options
#JVM_OPTS="$JVM_OPTS -XX:+UseParNewGC"
#JVM_OPTS="$JVM_OPTS -XX:+UseConcMarkSweepGC"
#JVM_OPTS="$JVM_OPTS -XX:+CMSParallelRemarkEnabled"
#JVM_OPTS="$JVM_OPTS -XX:SurvivorRatio=8"
#JVM_OPTS="$JVM_OPTS -XX:MaxTenuringThreshold=1"
#JVM_OPTS="$JVM_OPTS -XX:CMSInitiatingOccupancyFraction=75"
#JVM_OPTS="$JVM_OPTS -XX:+UseCMSInitiatingOccupancyOnly"
JVM_OPTS="$JVM_OPTS -XX:+UseParallelGC"
JVM_OPTS="$JVM_OPTS -XX:+UseParallelOldGC"

>> I have gc logs if anyone is interested.
> 
> Yes :)
>

By "have gc logs" I meant "had them until I accidental blew them away
while restarting a server".  Will post them in a day or two when there
is a reasonable amount of data or the quantum state collapses and the
problem vanishes when it is observed.

>> [1] http://img194.imageshack.us/img194/383/2weekmem.png
> 
> I did go back and revisit the old thread... maybe I'm missing
> something, but just to be real sure:
> 
> What does the "no color"/white mean on this graph? Is that application
> memory (resident set)?
> 
> I'm not really sure what I'm looking for since you already said you
> tested with 'standard' which rules out the
> resident-set-memory-as-a-result-of-mmap being counted towards the
> leak. But still.
> 

I will be the first to admit that Zabbix's graphs are not the... easiest
to read.  My interpretation is that no "color" is "none of the above"
and by being unavailable is thus in use by applications.  This fits with
what I see will free and measurements of the RSS of the jvm from /proc/.
 I'll leave free -s going for a few days while waiting on the gc logs as
an extra sanity test.  That's probably easier to reason about anyway.

Minor Follow-up: reduced cached mem; resident set size growth

2011-04-05 Thread Chris Burroughs

This is a minor followup to this thread which includes required context:

http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html

I haven't solved the problem, but since negative results can also be
useful I thought I would share them.  Things I tried unsuccessfully (on
individual nodes except for the upgrade):

- Upgrade from Cassandra 0.6 to 0.7
- Different collectors: -XX:+UseParallelGC -XX:+UseParallelOldGC
- JNA (but not mlockall)
- Switch disk_access_mode from standard to mmap_index_only (obviously in
this case RSS is less than useful, but overall memory graph still was
bad looking like this [1]).


On #cassandra there was speculation that a large (200k) row cache may be
inducing heap fragmentation.  I have not ruled this out but have been
unable to do that in stand alone ConcurrentLinkedHashMap stress testing.
 Since turning off the row cache would be a cure worse than the disease
I have not tried that yet with a real cluster.

Future possibilities would be to get the limits set right for mlockall,
trying combinations of the above, and running without caches.

I have gc logs if anyone is interested.

[1] http://img194.imageshack.us/img194/383/2weekmem.png

Re: IndexInterval Tuning

2011-04-05 Thread Chris Burroughs

On 04/05/2011 09:57 AM, Jonathan Ellis wrote:
> On Tue, Apr 5, 2011 at 8:54 AM, Jonathan Ellis  wrote:
>> Adjusting indexinterval is unlikely to be useful on very narrow rows.
>> (Its purpose is to make random access to _large_ rows doable.)
> 
> Whoops, that's column_index_size_in_kb.
> 
> I'd play w/ keycache before index_interval personally.  (If you can
> get 100% key cache hit rate it doesn't really matter what index
> interval is, as long as you can still build the cache effectively.)

I've already tried a key cache equal to and larger (up to what I have
heap space for) than my current row cache.  But for very narrow rows the
row cache is empirically and theoretically better.

I realise changing IndexInterval is an unusual proposed configuration,
but such is the burden of high cardinality narrow rows.

IndexInterval Tuning

2011-04-04 Thread Chris Burroughs

I have a case with very narrow rows.  As such I have a large row cache
that does nicely handles > 50% of requests.  I think it's likely that
the current tradeoff between page cache and row cache is reasonable.
Using a key cache doesn't make sense in this instance.  However, a third
option is to adjust the IndexInterval [1] [2].

It would theoretically be reasonable to:
 - Increase the IndexInterval to make more memory available for row or
page cache.
 - Decrease the IndexInterval for more effective sampling.

When the knob could be turned either direction it's hard to know which
way to start.

Does anyone have any experince succesfully adjusting the IndexInterval
for improved performance with narrow rows?


[1] https://issues.apache.org/jira/browse/CASSANDRA-1488

[2] http://www.datastax.com/dev/blog/whats-new-cassandra-066

Re: How to determine if repair need to be run

2011-03-30 Thread Chris Burroughs

On 03/29/2011 01:18 PM, Peter Schuller wrote:
> (What *would* be useful perhaps is to be able to ask a node for the
> time of its most recently started repair, to facilitate easier
> comparison with GCGraceSeconds for monitoring purposes.)

I concur.  JIRA time?

(Perhaps keeping track of the same thing for major compactions would
also be useful?)

Re: On 0.6.6 to 0.7.3 migration, DC-aware traffic and minimising data transfer

2011-03-14 Thread Chris Burroughs

On 03/11/2011 03:46 PM, Jonathan Ellis wrote:
> Repairs is not yet WAN-optimized but is still cheap if your replicas
> are close to consistent since only merkle trees + inconsistent ranges
> are sent over the network.
> 

What is the ticket number for WAN optimized repair?

Re: memory utilization

2011-03-11 Thread Chris Burroughs

On 03/10/2011 09:26 PM, Bill Hastings wrote:
> Hi All
> 
> Memory utilization reported by JCOnsole for Cassandra seems to be much
> lesser than that reported by top ("RES" memory). Can someone explain this?
> Maybe off topic but would appreciate a response.
> 

Is there an more or less constant amount of resident memory, or is it
growing over a period of days?

Re: Reducing memory footprint

2011-03-07 Thread Chris Burroughs

On 03/04/2011 03:51 PM, Casey Deccio wrote:
> Are you saying: that you want a smaller heap and what settings to change
>> to accommodate that, or that you have already set a small heap of x and
>> Cassandra is using significantly more than that?
>>
> 
> Based on my observation above, the latter.
> 
> Casey
> 

As  Aaron said then the first things to look at are your jvm settings,
jvm version, and io configuration (standard v mmap).


You may also wish to read this thread:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/reduced-cached-mem-resident-set-size-growth-td5967110.html

Re: cassandra in-production experiences with .7 series

2011-03-07 Thread Chris Burroughs

On 03/05/2011 05:27 PM, Paul Pak wrote:
> Hello all,
> 
> I was wondering if people could share their overall experiences with
> using .7 series of Cassandra in production?  Is anyone using it?
> 

For what it's worth we are using a dozen node 0.7.x cluster have not had
any major problems (our uses cases dodged most of the less pleasant
bugs).  This replaced  a smaller 0.6.x cluster that we were not happy with.

Weather the new code really helped (the main feature we wanted was mx4j
do to idiosyncratic features of our monitoring system) or not we didn't
have time to experimentally determine.

Re: OOM exceptions

2011-03-04 Thread Chris Burroughs

- Are you using a key cache?  How many keys do you have?  Across how
many column families

You configuration is unusual both in terms of not setting min heap ==
max heap and the percentage of available RAM used for the heap.  Did you
change the heap size in response to errors or for another reason?

On 03/04/2011 03:25 PM, Mark wrote:
> This happens during compaction and we are not using the RowsCached
> attribute.
> 
> Our initial/max heap are 2 and 6 respectively and we have 8 gigs in
> these machines.
> 
> Thanks
> 
> On 3/4/11 12:05 PM, Chris Burroughs wrote:
>> - Does this occur only during compaction or at seemingly random times?
>> - How large is your heap?  What jvm settings are you using? How much
>> physical RAM do you have?
>> - Do you have the row and/or key cache enabled?  How are they
>> configured?  How large are they when the OOM is thrown?
>>
>> On 03/04/2011 02:38 PM, Mark Miller wrote:
>>> Other than adding more memory to the machine is there a way to solve
>>> this? Please help. Thanks
>>>
>>> ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java
>>> (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception
>>> in thread Thread[COMPACTION-POOL:1,5,main]
>>> java.lang.OutOfMemoryError: Java heap space
>>>  at java.util.Arrays.copyOf(Arrays.java:2798)
>>>  at
>>> java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
>>>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>>>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>>>  at
>>> org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298)
>>>
>>>  at
>>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)
>>>
>>>
>>>  at
>>> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140)
>>>
>>>
>>>  at
>>> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
>>>
>>>
>>>  at
>>> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>>>
>>>
>>>  at
>>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>>
>>>
>>>  at
>>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>>
>>>
>>>  at
>>> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>>>
>>>
>>>  at
>>> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101)
>>>
>>>
>>>  at
>>> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82)
>>>
>>>  at
>>> java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>>>  at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>>
>>>
>>>  at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>>
>>>
>>>  at java.lang.Thread.run(Thread.java:636)
>>>
>

Re: OOM exceptions

2011-03-04 Thread Chris Burroughs

See also:
http://www.datastax.com/docs/0.7/troubleshooting/index#nodes-are-dying-with-oom-errors

On 03/04/2011 03:05 PM, Chris Burroughs wrote:
> - Does this occur only during compaction or at seemingly random times?
> - How large is your heap?  What jvm settings are you using? How much
> physical RAM do you have?
> - Do you have the row and/or key cache enabled?  How are they
> configured?  How large are they when the OOM is thrown?
> 
> On 03/04/2011 02:38 PM, Mark Miller wrote:
>> Other than adding more memory to the machine is there a way to solve
>> this? Please help. Thanks
>>
>> ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java
>> (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception
>> in thread Thread[COMPACTION-POOL:1,5,main]
>> java.lang.OutOfMemoryError: Java heap space
>> at java.util.Arrays.copyOf(Arrays.java:2798)
>> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
>> at java.io.DataOutputStream.write(DataOutputStream.java:107)
>> at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>> at
>> org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298)
>> at
>> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66)
>>
>> at
>> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311)
>>
>> at
>> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)
>>
>> at
>> org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
>>
>> at
>> org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)
>>
>> at
>> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140)
>>
>> at
>> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
>>
>> at
>> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
>>
>> at
>> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
>>
>> at
>> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
>>
>> at
>> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
>>
>> at
>> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
>>
>> at
>> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294)
>>
>> at
>> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101)
>>
>> at
>> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82)
>> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
>> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>>
>> at java.lang.Thread.run(Thread.java:636)
>>
>

Re: OOM exceptions

2011-03-04 Thread Chris Burroughs

- Does this occur only during compaction or at seemingly random times?
- How large is your heap?  What jvm settings are you using? How much
physical RAM do you have?
- Do you have the row and/or key cache enabled?  How are they
configured?  How large are they when the OOM is thrown?

On 03/04/2011 02:38 PM, Mark Miller wrote:
> Other than adding more memory to the machine is there a way to solve
> this? Please help. Thanks
> 
> ERROR [COMPACTION-POOL:1] 2011-03-04 11:11:44,891 CassandraDaemon.java
> (line org.apache.cassandra.thrift.CassandraDaemon$1) Uncaught exception
> in thread Thread[COMPACTION-POOL:1,5,main]
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2798)
> at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:111)
> at java.io.DataOutputStream.write(DataOutputStream.java:107)
> at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
> at
> org.apache.cassandra.utils.FBUtilities.writeByteArray(FBUtilities.java:298)
> at
> org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:66)
> 
> at
> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:311)
> 
> at
> org.apache.cassandra.db.SuperColumnSerializer.serialize(SuperColumn.java:284)
> 
> at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
> 
> at
> org.apache.cassandra.db.ColumnFamilySerializer.serializeWithIndexes(ColumnFamilySerializer.java:99)
> 
> at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:140)
> 
> at
> org.apache.cassandra.io.CompactionIterator.getReduced(CompactionIterator.java:43)
> 
> at
> org.apache.cassandra.utils.ReducingIterator.computeNext(ReducingIterator.java:73)
> 
> at
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
> 
> at
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
> 
> at
> org.apache.commons.collections.iterators.FilterIterator.setNextObject(FilterIterator.java:183)
> 
> at
> org.apache.commons.collections.iterators.FilterIterator.hasNext(FilterIterator.java:94)
> 
> at
> org.apache.cassandra.db.CompactionManager.doCompaction(CompactionManager.java:294)
> 
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:101)
> 
> at
> org.apache.cassandra.db.CompactionManager$1.call(CompactionManager.java:82)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
> at java.util.concurrent.FutureTask.run(FutureTask.java:166)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> 
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> 
> at java.lang.Thread.run(Thread.java:636)
>

Re: Reducing memory footprint

2011-03-04 Thread Chris Burroughs

On 03/04/2011 01:53 PM, Casey Deccio wrote:
> I have a small ring of cassandra nodes that have somewhat limited memory
> capacity for the moment.  Cassandra is eating up all the memory on these
> nodes.  I'm not sure where to look first in terms of reducing the foot
> print.  Keys cached?  Compaction?
> 
> Any hints would be greatly appreciated.
> 
> Regards,
> Casey
> 

What do you mean by "eating up the memory"?  Resident set size, low
memory available to page cache, excessive gc of the jvm's heap?

Are you saying: that you want a smaller heap and what settings to change
to accommodate that, or that you have already set a small heap of x and
Cassandra is using significantly more than that?

Re: Column name size

2011-02-11 Thread Chris Burroughs

On 02/11/2011 05:06 AM, Patrik Modesto wrote:
> Hi all!
> 
> I'm thinking if size of a column name could matter for a large dataset
> in Cassandra  (I mean lots of rows). For example what if I have a row
> with 10 columns each has 10 bytes value and 10 bytes name. Do I have
> half the row size just of the column names and the other half of the
> data (not counting storage overhead)?  What if I have 10M of these
> rows? Is there a difference? Should I use some 3bytes codes for a
> column name to save memory/bandwidth?
> 
> Thanks,
> Patrik

You are correct that you can for small row/column key values they key
itself can represent a large proportion of the total size.  I think you
will find the consensus on  this list is that trying to be clever with
names is usually not worth the additional complexity.

The right solution to this is
https://issues.apache.org/jira/browse/CASSANDRA-47.

1 2 >

1 - 100 of 122 matches

Mail list logo