Re: Losing keyspace on cassandra upgrade

2012-09-19 Thread Edward Sargisson
mmitLogReplayer.java (line 103) Skipped 
7506 mutations from unknown (probably removed) CF with id 1183
  INFO [main] 2012-09-19 15:15:50,325 CommitLog.java (line 131) Log replay 
complete, 0 replayed mutations

This is the first obvious indication something is wrong. Going further up in 
the log file I discover that the SSTableReader logs only system keyspace files.

Currently my cluster is in the folloing state:

node 1 runs cassandra 1.1.5, and doesn't know my keyspace
node 2 runs cassandra 1.1.1, and still nows my keyspace.

nodetool ring confirms this: node a has a load of 29kb, node 2 of roughly 1GB. 
The cluster itself is still intact, i.e. nodetool ring shows both nodes.

I tried a nodetool resetlocalschema, and nodetool repair, but that didn't 
change anything.

Any idea what I have been doing wrong (the preferred solution), or whether I 
stumbled over a cassandra bug (not so nice)?


   TIA, Thomas

'Like' us on Facebook for exclusive content and other resources on all 
Barracuda Networks solutions.
Visit http://barracudanetworks.com/facebook




--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: Losing keyspace on cassandra upgrade

2012-09-19 Thread Edward Sargisson

https://issues.apache.org/jira/browse/CASSANDRA-4583

On 12-09-19 08:30 AM, Michael Kjellman wrote:

@Edward Do you have a bug number for that by chance?

On Sep 19, 2012, at 8:25 AM, "Edward Sargisson" 
mailto:edward.sargis...@globalrelay.net>> wrote:

We've seen that before too - supposedly it was fixed in 1.1.5. Your experience 
casts some doubt on that.

Our workaround, thus far, is to shut down the entire ring and then bring each 
node back up starting with known good.
Then you do nodetool resetlocalschema on the node that's confused and make sure 
it gets the schema linked up properly.
Then nodetool repair.

I see you've done that but we found a complete ring restart was necessary. This 
was on Cass 1.1.1.

Cheers,
Edward

On 12-09-19 08:12 AM, Michael Kjellman wrote:

Sounds like you are loosing your system keyspace. When you say nothing 
important changed between yaml files do you mean with or without your changes?

Did your data directories change in the migration? Permissions okay?

I've done a 1.1.1 to 1.1.5 upgrade on many of my nodes without issue..

On Sep 19, 2012, at 7:44 AM, "Thomas Stets" 
<mailto:thomas.st...@gmail.com> wrote:



I consistently keep losing my keyspace on upgrading from cassandra 1.1.1 to 
1.1.5

I have the same cassandra keyspace on all our staging systems:

development:  a 3-node cluster
integration: a 3-node cluster
QS: a 2-node cluster
(productive will be a 4-node cluster, which is as yet not active)

All clusters were running cassandra 1.1.1. Before going productive I wanted to 
upgrade to the
latest productive version of cassandra.

In all cases my keyspace disappeared when I started the cluster with cassandra 
1.1.5.
On the development system I didn't realize at first what was happening. I just 
wondered that nodetool
showed a very low amount of data. On integration I saw the problem quickly, but 
could not recover the
data. I re-installed the cassandra cluster from scratch, and populated it with 
our test data, so our
developers could work.

I am currently using the QS system to recreate the problem and try to find what 
I am doing wrong,
and how I can avoid losing productive data once we are live.

Basically I was doing the following:

1. create a snapshot on every node
2. create a tar.gz of my data directory, just to be safe
3. shut down and re-start cassandra 1.1.1 (just to see that it is not the 
re-start that is creating the problem)
4. verify that the keyspace is still known, and the data present.
5. shut down cassandra 1.1.1
6. copy the config to cassandra 1.1.5 (doing a diff of cassandra.yaml to the 
new one first, to see whether anything important has changed)
7. start cassandra 1.1.5

In the log file, after the "Replaying ..." messages I find the following:

  INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped 
759 mutations from unknown (probably removed) CF with id 1187
  INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped 
606 mutations from unknown (probably removed) CF with id 1186
  INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped 
53 mutations from unknown (probably removed) CF with id 1185
  INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped 
1945 mutations from unknown (probably removed) CF with id 1184
  INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped 
1945 mutations from unknown (probably removed) CF with id 1191
  INFO [main] 2012-09-19 15:15:50,323 CommitLogReplayer.java (line 103) Skipped 
7506 mutations from unknown (probably removed) CF with id 1190
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
88 mutations from unknown (probably removed) CF with id 1189
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
87 mutations from unknown (probably removed) CF with id 1188
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
354 mutations from unknown (probably removed) CF with id 1195
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
87 mutations from unknown (probably removed) CF with id 1194
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
45 mutations from unknown (probably removed) CF with id 1192
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
82 mutations from unknown (probably removed) CF with id 1197
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
46386 mutations from unknown (probably removed) CF with id 1177
  INFO [main] 2012-09-19 15:15:50,324 CommitLogReplayer.java (line 103) Skipped 
69 mutations from unknown (probably removed) CF with id 1178
  INFO [main] 2012-09-19 15:15:50,325 CommitLogReplayer.java (line 103) Skipped 
73 mutations from unknown (probably removed) CF with id 1179
  INFO [main] 2012-09-19 1

Re: Nodetool repair, exit code/status?

2012-10-09 Thread Edward Sargisson

This is a problem for us as well.
Our current planned approach is to parse the logs for repair errors.
Having nodetool repair return an exit code for some of this failures 
would be *very* useful.


Cheers,
Edward

On 12-10-08 06:49 PM, David Daeschler wrote:

Hello.

In the process of trying to streamline and provide better reporting
for various data storage systems, I've realized that although we're
verifying that nodetool repair runs, we're not verifying that it is
successful.

I found a bug relating to the exit code for nodetool repair, where, in
some situations, there is no way to verify the repair has completed
successfully: https://issues.apache.org/jira/browse/CASSANDRA-2666

Is this still a problem? What is the best way to monitor the final
status of the repair command to make sure all is well?


Thank you ahead of time for any info.
- David


--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Java 7 support?

2012-10-16 Thread Edward Sargisson

Hi all,
The Datastax documentation says that Java 7 is not recommended[1]. 
However, Java 6 is due to EOL in Feb 2013 so what is the reasoning 
behind that comment?


Is it something we should be still concerned about?

Cheers,
Edward

Links:
[1] http://www.datastax.com/docs/1.1/install/install_deb, step 1
--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: cannot parse 'name' as hex bytes

2012-12-05 Thread Edward Sargisson

You're not casting the types.
Cassandra stores everything as bytes. You either need to set the 
key_validation_class to UTF8Type or use the utf8() function to convert.


http://www.datastax.com/docs/1.1/dml/using_cli


On 12-12-05 03:14 AM, Yogesh Dhari wrote:


Hi all,

I am very new to Cassandra,


I am using version-1.1.7 and followed the steps on single node machine 
mention in GETTING STARTED.


I have created key-space named as Demo and then tried to create column 
family names Work as



[default@DEMO] create column family Work ;
5c85706f-87fe-38f1-b23f-c6180e45d178
Waiting for schema agreement...
... schemas agree across the cluster

Now if I do..

[default@DEMO] set Work[1234][name] = scott ;

I got this error.

org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' 
as hex bytes



Please help and suggest.

Thanks & Regards
Yogesh Kumar









Re: Force data to a specific node

2013-01-02 Thread Edward Sargisson
Why would you want to?


From: Everton Lima 
To: Cassandra-User 
Sent: Wed Jan 02 18:03:49 2013
Subject: Force data to a specific node

It is possible to force a data to stay in a specific node?

--
Everton Lima Aleixo
Bacharel em Ciência da Computação pela UFG
Mestrando em Ciência da Computação pela UFG
Programador no LUPA



Limitations on secondary indexes

2013-03-07 Thread Edward Sargisson
Hi,
Please correct me if this statement is wrong.

Secondary indexes are limited to indexing 2 billion rows - because they
turn a row into a column and C* has a limit of 2 billion columns.

Cheers,
Edward


Performance impact of static vs dynamic columns and mixing the two in the same CF

2012-06-05 Thread Edward Sargisson

Hi all,
A question has come up in our team about the performance impact of 
static vs dynamic columns. We'd like to ask two questions:


Quick background: We are using a custom app to write to Cassandra using 
Hector. Production is Solaris and pre-prod is generally Centos. We're 
currently on 0.7 but will be moving to 1.1 very shortly.


1. Does specifying the type of a column affect performance other than 
the cost of validating data as it is stored?

e.g. does it help compaction, etc?
From my reading of the docs the advantage is that the data will be 
validated on write and that the various dev tools can deserialize into a 
human readable form easily.


2. Is there any impact to mixing static and dynamic columns in the same 
column family? (Follow-up question: is this far outside of the 
designers' intentions and thus unsafe?)
The docs seem to indicate that the designers think of static column 
families and dynamic column families and *not* a mixture of the two.


My mental model is that a column is just a column. It's possible to 
specify some metadata about columns for validation and display but 
that's about it. Is there something to change this model?


Thanks in advance for any comments.

Cheers,
Edward


--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London  (+44.0800.032.9829) | 
Singapore  (+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein.  All trademarks 
are the property of their respective owners.




Change of behaviour in multiget_slice query for unknown keys between 0.7 and 1.1?

2012-06-18 Thread Edward Sargisson

Hi all,
Was there a change of behaviour in multiget_slice query in Cassandra or 
Hector between 0.7 and 1.1 when dealing with a key that doesn't exist?


We've just upgraded and our in memory unit test is failing (although 
just on my machine). The test code is looking for a key that doesn't 
exist and expects to get null. Instead it gets a ColumnSlice with a 
single column called val. If there were something there then we'd expect 
columns with names like bytes, int or string. Other rows in the column 
family have those columns as well as val.


Is there a reason for this behaviour?
I'd like to see if there was an explanation before I change the unit 
test for it.


Many thanks in advance,
Edward

--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London  (+44.0800.032.9829) | 
Singapore  (+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein.  All trademarks 
are the property of their respective owners.




Node doesn't rejoin ring after restart

2012-08-03 Thread Edward Sargisson

Hi all,
I'm testing our procedures for handling some Cassandra failure scenarios 
and I'm not understanding something.


I'm testing on a 3 node cluster with a replication_factor of 3.
I stopped one of the nodes for 5 or so minutes and run some application 
tests. Everything was fine.


Then I started cassandra on that node again and it refuses to re-join 
the ring. It can see itself as up but not the other nodes. The other 
nodes can see themselves but don't see it as up.


I deliberately haven't followed any of the token replacement methods 
outlined in the docs. I'm working on the assumption that a small outage 
on one node shouldn't cause extraordinary action.


Nor do I want to have to stop every node before bringing them up one by one.

What am I missing? Am I forced into those time consuming methods every 
time I want to restart?


Thoughts?

Cheers,
Edward

--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Node forgets about most of its column families

2012-08-23 Thread Edward Sargisson

Hi all,
I was wondering if anybody had seen the following behaviour before and 
how we might detect it and keep the application running.


We have a 6 node cluster. It seems that one of these nodes forgot about 
all but one of the application column families - possibly after a 
restart. Then, when our application connects using Hector, it can't find 
any data so gives back an exception.


I'm currently running nodetool repair on one of the *other* nodes which 
is taking a very long time to complete. (35mins and counting a load of <9MB)


The  logs from the failing node say:
 INFO [MemoryMeter:1] 2012-08-23 14:59:14,807 Memtable.java (line 213) 
CFS(Keyspace='system', ColumnFamily='HintsColumnFamily') liveRatio is
1.1219167666485013 (just-counted was 1.0).  calculation took 28ms for 
1252 columns
 INFO [main] 2012-08-23 14:59:14,949 CommitLogReplayer.java (line 272) 
Finished reading /var/lib/cassandra/commitlog/CommitLog-22654969122258

24.log
 INFO [main] 2012-08-23 14:59:14,950 CommitLogReplayer.java (line 103) 
Skipped 8216 mutations from unknown (probably removed) CF with id 1016
 INFO [main] 2012-08-23 14:59:14,950 CommitLogReplayer.java (line 103) 
Skipped 3013 mutations from unknown (probably removed) CF with id 1017


... and so on.

Hector is saying:
InvalidRequestException(why:unconfigured columnfamily user_conversations)


Thanks for any comments or advice,
Edward

--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: Node forgets about most of its column families

2012-08-23 Thread Edward Sargisson

Ah, yes, I forgot that bit thanks!

1.1.2 running on Centos.

Running nodetool resetlocalschema then nodetool repair fixed the problem 
but not understanding what happened is a concern.


Cheers,
Edward


On 12-08-23 12:40 PM, Rob Coli wrote:

On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson
 wrote:

I was wondering if anybody had seen the following behaviour before and how
we might detect it and keep the application running.

I don't know the answer to your problem, but anyone who does will want
to know in what version of Cassandra you are encountering this issue.
:)

=Rob



--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: Node forgets about most of its column families

2012-08-24 Thread Edward Sargisson

Sadly, I don't think we can get much.

All I know about the repro is that it was around a node restart. I've 
just tried that and everything's fine. I see now ERROR level messages in 
the logs.


Clearly, some other conditions are required but we don't know them as yet.

Many thanks,
Edward


On 12-08-24 03:29 AM, aaron morton wrote:
If this is still a test environment can you try to reproduce the fault 
? Or provide some more details on the sequence of events?


If you still have the logs around can you see if any ERROR level 
messages were logged?


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/08/2012, at 8:33 AM, Edward Sargisson 
<mailto:edward.sargis...@globalrelay.net>> wrote:



Ah, yes, I forgot that bit thanks!

1.1.2 running on Centos.

Running nodetool resetlocalschema then nodetool repair fixed the 
problem but not understanding what happened is a concern.


Cheers,
Edward


On 12-08-23 12:40 PM, Rob Coli wrote:

On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson
  wrote:

I was wondering if anybody had seen the following behaviour before and how
we might detect it and keep the application running.

I don't know the answer to your problem, but anyone who does will want
to know in what version of Cassandra you are encountering this issue.
:)

=Rob



--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net 
<mailto:edward.sargis...@globalrelay.net>



*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | 
Singapore (+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein.  All 
trademarks are the property of their respective owners.






--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Automating nodetool repair

2012-08-27 Thread Edward Sargisson

Hi all,
So nodetool repair has to be run regularly on all nodes. Does anybody 
have any interesting strategies or tools for doing this or is everybody 
just setting up cron to do it?


For example, one could write some Puppet code to splay the cron times 
around so that only one should be running at once.
Or, perhaps, a central orchestrator that is given some known quiet time 
and works its way through the list, running nodetool repair one at a 
time (using RPC?) until it runs out of time.


Cheers,
Edward
--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: Node forgets about most of its column families

2012-08-28 Thread Edward Sargisson

For the record, we just had a recurrence of this.
This time, when the node (#5) came back it didn't properly rejoin the ring.
We stopped every node and brought them back one by one to get the ring 
to link up correctly.

Then, all the even nodes (#2, #4, #6) had out of data schemas.

nodetool resetlocalschema works.
But the following nodetool repair crashes. It has to be stopped and then 
re-started.


Are there any suggestions for logging or similar so that we can get a 
clue next time this happens.


Cheers,
Edward


On 12-08-24 11:18 AM, Edward Sargisson wrote:

Sadly, I don't think we can get much.

All I know about the repro is that it was around a node restart. I've 
just tried that and everything's fine. I see now ERROR level messages 
in the logs.


Clearly, some other conditions are required but we don't know them as yet.

Many thanks,
Edward


On 12-08-24 03:29 AM, aaron morton wrote:
If this is still a test environment can you try to reproduce the 
fault ? Or provide some more details on the sequence of events?


If you still have the logs around can you see if any ERROR level 
messages were logged?


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 24/08/2012, at 8:33 AM, Edward Sargisson 
<mailto:edward.sargis...@globalrelay.net>> wrote:



Ah, yes, I forgot that bit thanks!

1.1.2 running on Centos.

Running nodetool resetlocalschema then nodetool repair fixed the 
problem but not understanding what happened is a concern.


Cheers,
Edward


On 12-08-23 12:40 PM, Rob Coli wrote:

On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson
  wrote:

I was wondering if anybody had seen the following behaviour before and how
we might detect it and keep the application running.

I don't know the answer to your problem, but anyone who does will want
to know in what version of Cassandra you are encountering this issue.
:)

=Rob



--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net 
<mailto:edward.sargis...@globalrelay.net>



*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | 
Singapore (+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for 
the use of the individual or entity to which it is addressed, and 
may contain information that is privileged, confidential, and exempt 
from disclosure under applicable law.  Global Relay will not be 
liable for any compliance or technical information provided herein.  
All trademarks are the property of their respective owners.






--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | 
Singapore (+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein.  All 
trademarks are the property of their respective owners.




--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: Automating nodetool repair

2012-08-28 Thread Edward Sargisson

Thanks a very nice approach.

If every nodetool repair uses -pr does that satisfy the requirement to 
run a repair before GCGraceSeconds expires? In otherwords, will we get a 
correct result using -pr everywhere.


Secondly, what's the need for sleep 120?

Cheers,
Edward

On 12-08-28 07:03 AM, Edward Capriolo wrote:

You can consider adding -pr. When iterating through all your hosts
like this. -pr means primary range, and will do less duplicated work.

On Mon, Aug 27, 2012 at 8:05 PM, Aaron Turner  wrote:

I use cron.  On one box I just do:

for n in node1 node2 node3 node4 ; do
nodetool -h $n repair
sleep 120
done

A lot easier then managing a bunch of individual crontabs IMHO
although I suppose I could of done it with puppet, but then you always
have to keep an eye out that your repairs don't overlap over time.

On Mon, Aug 27, 2012 at 4:52 PM, Edward Sargisson
 wrote:

Hi all,
So nodetool repair has to be run regularly on all nodes. Does anybody have
any interesting strategies or tools for doing this or is everybody just
setting up cron to do it?

For example, one could write some Puppet code to splay the cron times around
so that only one should be running at once.
Or, perhaps, a central orchestrator that is given some known quiet time and
works its way through the list, running nodetool repair one at a time (using
RPC?) until it runs out of time.

Cheers,
Edward
--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net


866.484.6630
New York | Chicago | Vancouver  |  London  (+44.0800.032.9829)  |  Singapore
(+65.3158.1301)

Global Relay Archive supports email, instant messaging, BlackBerry,
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook
and more.


Ask about Global Relay Message — The Future of Collaboration in the
Financial Services World


All email sent to or from this address will be retained by Global Relay’s
email archiving system. This message is intended only for the use of the
individual or entity to which it is addressed, and may contain information
that is privileged, confidential, and exempt from disclosure under
applicable law.  Global Relay will not be liable for any compliance or
technical information provided herein.  All trademarks are the property of
their respective owners.



--
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
http://tcpreplay.synfin.net/ - Pcap editing and replay tools for Unix & Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin
"carpe diem quam minimum credula postero"


--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: Node forgets about most of its column families

2012-08-29 Thread Edward Sargisson

Hi Aaron,
Thanks for the reply. I've recorded what we know at 
https://issues.apache.org/jira/browse/CASSANDRA-4583.
This includes log snippets from two of the nodes from around the time. I 
don't know what is relevant so they've got everything that was in the 
system log at the time of the failure and recovery.


Nodetool crashed but not returning, having nothing appear in the logs 
and nodetool compactionstats and nodetool netstats indicating that 
nothing was happening.


Thanks for your time looking at this.

Cheers,
Edward


On 12-08-29 02:44 AM, aaron morton wrote:
But the following nodetool repair crashes. It has to be stopped and 
then re-started.

How did it crash ?

Are there any suggestions for logging or similar so that we can get a 
clue next time this happens.

Can you make the logs from #5 available?

If you feel you can describe the situation please create a ticket on 
https://issues.apache.org/jira/browse/CASSANDRA


Cheers


-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com

On 29/08/2012, at 8:38 AM, Edward Sargisson 
<mailto:edward.sargis...@globalrelay.net>> wrote:



For the record, we just had a recurrence of this.
This time, when the node (#5) came back it didn't properly rejoin the 
ring.
We stopped every node and brought them back one by one to get the 
ring to link up correctly.

Then, all the even nodes (#2, #4, #6) had out of data schemas.

nodetool resetlocalschema works.
But the following nodetool repair crashes. It has to be stopped and 
then re-started.


Are there any suggestions for logging or similar so that we can get a 
clue next time this happens.


Cheers,
Edward


On 12-08-24 11:18 AM, Edward Sargisson wrote:

Sadly, I don't think we can get much.

All I know about the repro is that it was around a node restart. 
I've just tried that and everything's fine. I see now ERROR level 
messages in the logs.


Clearly, some other conditions are required but we don't know them 
as yet.


Many thanks,
Edward


On 12-08-24 03:29 AM, aaron morton wrote:
If this is still a test environment can you try to reproduce the 
fault ? Or provide some more details on the sequence of events?


If you still have the logs around can you see if any ERROR level 
messages were logged?


Cheers

-
Aaron Morton
Freelance Developer
@aaronmorton
http://www.thelastpickle.com <http://www.thelastpickle.com/>

On 24/08/2012, at 8:33 AM, Edward Sargisson 
<mailto:edward.sargis...@globalrelay.net>> wrote:



Ah, yes, I forgot that bit thanks!

1.1.2 running on Centos.

Running nodetool resetlocalschema then nodetool repair fixed the 
problem but not understanding what happened is a concern.


Cheers,
Edward


On 12-08-23 12:40 PM, Rob Coli wrote:

On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson
  wrote:

I was wondering if anybody had seen the following behaviour before and how
we might detect it and keep the application running.

I don't know the answer to your problem, but anyone who does will want
to know in what version of Cassandra you are encountering this issue.
:)

=Rob



--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net 
<mailto:edward.sargis...@globalrelay.net>



*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | 
Singapore  (+65.3158.1301)


Global Relay Archive supports email, instant messaging, 
BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, 
LinkedIn, Twitter, Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for 
the use of the individual or entity to which it is addressed, and 
may contain information that is privileged, confidential, and 
exempt from disclosure under applicable law.  Global Relay will 
not be liable for any compliance or technical information provided 
herein.  All trademarks are the property of their respective owners.






--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net 
<mailto:edward.sargis...@globalrelay.net>



*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | 
Singapore (+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*— *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay’s email archiving system. This message is intended only for 
the use of the individual or entity to which it is addressed, and 
may contain informatio

Re: Schema Disagreement after migration from 1.0.6 to 1.1.4

2012-09-05 Thread Edward Sargisson

I would try nodetool resetlocalschema.


On 12-09-05 07:08 AM, Martin Koch wrote:

Hi list

We have a 5-node Cassandra cluster with a single 1.0.9 installation 
and four 1.0.6 installations.


We have tried installing 1.1.4 on one of the 1.0.6 nodes (following 
the instructions on http://www.datastax.com/docs/1.1/install/upgrading).


After bringing up 1.1.4 there are no errors in the log, but the 
cluster now suffers from schema disagreement


[default@unknown] describe cluster;
Cluster Information:
   Snitch: org.apache.cassandra.locator.SimpleSnitch
   Partitioner: org.apache.cassandra.dht.RandomPartitioner
   Schema versions:
59adb24e-f3cd-3e02-97f0-5b395827453f: [10.10.29.67] <-The new 1.1.4 node

943fc0a0-f678-11e1--339cf8a6c1bf: [10.10.87.228, 10.10.153.45, 
10.10.145.90, 10.38.127.80] <- nodes in the old cluster


The recipe for recovering from schema disagreement 
(http://wiki.apache.org/cassandra/FAQ#schema_disagreement) doesn't 
cover the new directory layout. The system/Schema directory is empty 
save for a snapshots subdirectory. system/schema_columnfamilies and 
system/schema_keyspaces contain some files. As described in datastax's 
description, we tried running nodetool upgradesstables. When this had 
done, describe schema in the cli showed a schema definition which 
seemed correct, but was indeed different from the schema on the other 
nodes in the cluster.


Any clues on how we should proceed?

Thanks,
/Martin Koch


--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




How to replace a dead *seed* node while keeping quorum

2012-09-11 Thread Edward Sargisson

Hi all,
We just ran into an interesting and unexpected situation with restarting 
a downed node.


If the downed node is a seed node then neither of the replace a dead 
node procedures work (-Dcassandra.replace_token and taking 
initial_token-1). The ring remains split.
The host is listed as a seed in the config for the other members of the 
ring. If we rename the host then it will rejoin the ring.
In other words, if the host name is on the seeds list then it appears 
that the rest of the ring refuses to bootstrap it.


This leads to a problem: If the node needs to be taken out of the seeds 
list on every working node then that requires a restart of each node - 
which means that, for short periods, the ring is missing 2 nodes and a 
quorum read or write (RF=3) will fail.


Are there any useful tricks for restarting the node with the same 
hostname or are we expected to rename the node?


Cheers,
Edward
--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.




Re: How to replace a dead *seed* node while keeping quorum

2012-09-12 Thread Edward Sargisson
I'm reposting my colleague's reply to Rob to the list (with James' 
permission) in case others are interested.


I'll add to James' post below to say I don't believe we saw the message 
that that slice of code would have printed.


"

Hey Rob,

Ed's AWOL right now and I'm not onu@c.a.o, but I can tell you that when
I removed the downed seed node from its own list of seed nodes in
cassandra.yaml that it didn't join the existing ring nor did it get any
schemas or data from the existing ring; it felt like timeouts were
happening. (IANA Cassandra wizard, so excuse my terminology impedance.)

Changing the machine's hostname and giving it a new IP, it behaved as
expected; joining the ring, syncing both schema and associated data.

Downed node is 1.1.4, the rest of the ring is 1.1.2.

I'm in a situation where I can revert the IP/hostname change and retry
the scenario as needed if you've got any ideas.

HTH,

   JAmes"


Cheers,
Edward

On 12-09-12 03:53 PM, Rob Coli wrote:

On Tue, Sep 11, 2012 at 4:21 PM, Edward Sargisson
 wrote:

If the downed node is a seed node then neither of the replace a dead node
procedures work (-Dcassandra.replace_token and taking initial_token-1). The
ring remains split.
[...]
In other words, if the host name is on the seeds list then it appears that
the rest of the ring refuses to bootstrap it.

Close, but not exactly...

"./src/java/org/apache/cassandra/service/StorageService.java" line 559 of 3090
"
if (DatabaseDescriptor.isAutoBootstrap()
 &&
DatabaseDescriptor.getSeeds().contains(FBUtilities.getBroadcastAddress())
 && !SystemTable.isBootstrapped())
 logger_.info("This node will not auto bootstrap because it
is configured to be a seed node.");
"

getSeeds asks your seed provider for a list of seeds. If you are using
the SimpleSeedProvider, this basically turns the list from "seeds" in
cassandra.yaml on the local node into a list of hosts.

So it isn't that the other nodes have this node in their seed list..
it's that the node you are replacing has itself in its own seed list,
and shouldn't. I understand that it can be tricky in conf management
tools to make seed nodes' seed lists not contain themselves, but I
believe it is currently necessary in this case.

FWIW, it's unclear to me (and Aaron Morton, whose curiousity was
apparently equally piqued and is looking into it further..) why
exactly seed nodes shouldn't bootstrap. It's possible that they only
shouldn't bootstrap without being in "hibernate" mode, and that the
code just hasn't been re-written post replace_token/hibernate to say
that it's ok for seed nodes to bootstrap as long as they hibernate...

=Rob



--

Edward Sargisson

senior java developer
Global Relay

edward.sargis...@globalrelay.net <mailto:edward.sargis...@globalrelay.net>


*866.484.6630*
New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore 
(+65.3158.1301)


Global Relay Archive supports email, instant messaging, BlackBerry, 
Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, 
Facebook and more.



Ask about *Global Relay Message* 
<http://www.globalrelay.com/services/message>*--- *The Future of 
Collaboration in the Financial Services World


*
*All email sent to or from this address will be retained by Global 
Relay's email archiving system. This message is intended only for the 
use of the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential, and exempt from 
disclosure under applicable law.  Global Relay will not be liable for 
any compliance or technical information provided herein. All trademarks 
are the property of their respective owners.