Re: Exception in Hadoop Word Count sample

2011-09-15 Thread Tharindu Mathew
Yes. That's the problem. Thanks Jonathan.

I'm actually using trunk against a 0.7. How can I generate the distro in
trunk?

Forgive my ignorance, I'm more used to maven.

On Thu, Sep 15, 2011 at 1:08 AM, Jonathan Ellis jbel...@gmail.com wrote:

 You're using a 0.8 wordcount against a 0.7 Cassandra?

 On Wed, Sep 14, 2011 at 2:19 PM, Tharindu Mathew mcclou...@gmail.com
 wrote:
  I see $subject. Can anyone help me to rectify this?
  Stacktrace:
  Exception in thread main org.apache.thrift.TApplicationException:
 Required
  field 'replication_factor' was not found in serialized data! Struct:
  KsDef(name:wordcount,
  strategy_class:org.apache.cassandra.locator.SimpleStrategy,
  strategy_options:{replication_factor=1}, replication_factor:0,
  cf_defs:[CfDef(keyspace:wordcount, name:input_words,
 column_type:Standard,
  comparator_type:AsciiType, default_validation_class:AsciiType),
  CfDef(keyspace:wordcount, name:output_words, column_type:Standard,
  comparator_type:AsciiType, default_validation_class:AsciiType),
  CfDef(keyspace:wordcount, name:input_words_count, column_type:Standard,
  comparator_type:UTF8Type, default_validation_class:CounterColumnType)])
  at
 
 org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
  at
 
 org.apache.cassandra.thrift.Cassandra$Client.recv_system_add_keyspace(Cassandra.java:1531)
  at
 
 org.apache.cassandra.thrift.Cassandra$Client.system_add_keyspace(Cassandra.java:1514)
  at WordCountSetup.setupKeyspace(Unknown Source)
  at WordCountSetup.main(Unknown Source)
  --
  Regards,
 
  Tharindu
  blog: http://mackiemathew.com/
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




-- 
Regards,

Tharindu

blog: http://mackiemathew.com/


Re: Exception in Hadoop Word Count sample

2011-09-15 Thread Tharindu Mathew
Found it. 'ant artifacts'

On Thu, Sep 15, 2011 at 12:02 PM, Tharindu Mathew mcclou...@gmail.comwrote:

 Yes. That's the problem. Thanks Jonathan.

 I'm actually using trunk against a 0.7. How can I generate the distro in
 trunk?

 Forgive my ignorance, I'm more used to maven.


 On Thu, Sep 15, 2011 at 1:08 AM, Jonathan Ellis jbel...@gmail.com wrote:

 You're using a 0.8 wordcount against a 0.7 Cassandra?

 On Wed, Sep 14, 2011 at 2:19 PM, Tharindu Mathew mcclou...@gmail.com
 wrote:
  I see $subject. Can anyone help me to rectify this?
  Stacktrace:
  Exception in thread main org.apache.thrift.TApplicationException:
 Required
  field 'replication_factor' was not found in serialized data! Struct:
  KsDef(name:wordcount,
  strategy_class:org.apache.cassandra.locator.SimpleStrategy,
  strategy_options:{replication_factor=1}, replication_factor:0,
  cf_defs:[CfDef(keyspace:wordcount, name:input_words,
 column_type:Standard,
  comparator_type:AsciiType, default_validation_class:AsciiType),
  CfDef(keyspace:wordcount, name:output_words, column_type:Standard,
  comparator_type:AsciiType, default_validation_class:AsciiType),
  CfDef(keyspace:wordcount, name:input_words_count, column_type:Standard,
  comparator_type:UTF8Type, default_validation_class:CounterColumnType)])
  at
 
 org.apache.thrift.TApplicationException.read(TApplicationException.java:108)
  at
 
 org.apache.cassandra.thrift.Cassandra$Client.recv_system_add_keyspace(Cassandra.java:1531)
  at
 
 org.apache.cassandra.thrift.Cassandra$Client.system_add_keyspace(Cassandra.java:1514)
  at WordCountSetup.setupKeyspace(Unknown Source)
  at WordCountSetup.main(Unknown Source)
  --
  Regards,
 
  Tharindu
  blog: http://mackiemathew.com/
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




 --
 Regards,

 Tharindu

 blog: http://mackiemathew.com/




-- 
Regards,

Tharindu

blog: http://mackiemathew.com/


RE: Get CL ONE / NTS

2011-09-15 Thread Pierre Chalamet
I do not agree here. I trade consistency (it's more data miss than
consistency here) over performance in my case. 

I'm okay to handle the popping of the Spanish inquisition in the current DC
by triggering a new read with a stronger CL somewhere else (for example in
other DCs).

If the data is nowhere to be found or nothing is reachable, well, it's sad
but true but it will be the end of the game. Fine.

 

What I'm missing is a clear behavior for CL.ONE. I'm unsure about what nodes
are used by ONE and how the filtering of missing data/error is done. I've
landed in ReadCallback.java but error handling is out of my reach for the
moment.

 

Thanks,

- Pierre

 

From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: Thursday, September 15, 2011 12:27 AM
To: user@cassandra.apache.org
Subject: Re: Get CL ONE / NTS

 

Are you advising CL.ONE does not worth the game when considering
read performance ?

Consistency is not performance, it's a whole new thing to tune in your
application. If you have performance issues deal with those as performance
issues, better code / data model / hard ware. 

 

By the way, I do not have consistency problem at all - data is only written
once

Nobody expects a consistency problem. It's chief weapon is surprise.
Surprise and fear. It's two weapons are fear and surprise. And so forth
http://www.youtube.com/watch?v=Ixgc_FGam3s

 

If you write at LOCAL QUORUM in DC 1 and DC 2 is down at the start of the
request, a hint will be stored in DC 1. Some time later when DC 2 comes back
that hint will be sent to DC 2. If in the mean time you read from DC 2 at CL
ONE you will not get that change. With Read Repair enabled it will repair in
the background and you may get a different response on the next read (Am
guessing here, cannot remember exactly how RR works cross DC) 

 

 Cheers

 

 

 

-

Aaron Morton

Freelance Cassandra Developer

@aaronmorton

http://www.thelastpickle.com

 

On 15/09/2011, at 10:07 AM, Pierre Chalamet wrote:





Thanks Aaron, didn't seen your answer before mine.

I do agree for 2/ I might have read error. Good suggestion to use
EACH_QUORUM  - it could be a good trade off to read at this level if ONE
fails.

Maybe using LOCAL_QUORUM might be a good answer and will avoid headache
after all. Are you advising CL.ONE does not worth the game when considering
read performance ?

By the way, I do not have consistency problem at all - data is only written
once (and if more it is always the same data) and read several times across
DC. I only have replication problems. That's why I'm more inclined to use
CL.ONE for read if possible.

Thanks,
- Pierre


-Original Message-
From: aaron morton [mailto:aa...@thelastpickle.com] 
Sent: Wednesday, September 14, 2011 11:48 PM
To: user@cassandra.apache.org; pie...@chalamet.net
Subject: Re: Get CL ONE / NTS

Your current approach to Consistency opens the door to some inconsistent
behavior. 




1/ Will I have an error because DC2 does not have any copy of the data ? 

If you read from DC2 at CL ONE and the data is not replicated it will not be
returned. 




2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2

?
Not at CL ONE. If you used CL EACH QUORUM then the read will go to all the
DC's. If DC2 is behind DC1 then you will get the data form DC1. 




3/ In case of partial replication to DC2, will I see sometimes errors

about servers not holding the data in DC2 ?
Depending on the API call and the client, working at CL ONE, you will see
either errors or missing data. 




4/ Does Get CL ONE failed as soon as the fastest server to answer tell it

does not have the data or does it waits until all servers tell they do not
have the data ? 
yes

Consider 

using LOCAL QUORUM for write and read, will make things a bit more
consistent but not add inter DC overhead into the request latency. Still
possible to not get data in DC2 if it is totally disconnected from the DC1 

write at LOCAL QUORUM and read at EACH QUORUM . Will so you can always read,
requests in DC2 will fail if DC1 is not reachable. 

Hope that helps. 


-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15/09/2011, at 1:33 AM, Pierre Chalamet wrote:




Hello,

 

I have 2 datacenters. Cassandra is configured as follow:

- RackInferringSnitch

- NetworkTopologyStrategy for CF

- strategy_options: DC1:3 DC2:3

 

Data are written using CL LOCAL_QUORUM so data written from one datacenter

will eventually be replicated to the other datacenter. Data is always
written exactly once. 



 

On the other side, I'd like to improve the read path. I'm using actually

the CL ONE since data is only written once (ie: timestamp is more or less
meaningless in my case).



 

This is where I have some doubts: if data is written on DC1 and

tentatively read from DC2 while the data is still not replicated or
partially replicated (for whatever good reason since replication is async),

how did hprof file generated?

2011-09-15 Thread Yan Chunlu
in one of my node, I found many hprof files in the cassandra installation
directory, they are using as much as 200GB disk space.  other nodes didn't
have those files.

turns out that those files are used for memory analyzing, not sure how they
are generated?


like these:

java_pid10626.hprof  java_pid13898.hprof  java_pid17061.hprof
 java_pid21002.hprof  java_pid23194.hprof  java_pid29241.hprof
 java_pid5013.hprof


Re: how did hprof file generated?

2011-09-15 Thread Peter Schuller
 in one of my node, I found many hprof files in the cassandra installation
 directory, they are using as much as 200GB disk space.  other nodes didn't
 have those files.
 turns out that those files are used for memory analyzing, not sure how they
 are generated?

You're probably getting OutOfMemory exceptions. Cassandra by default
runs with -XX:+HeapDumpOnOutOfMemory (or some such, I forget exactly
what it's called). If this is the case, you probably need to increase
your heap size or adjust Cassandra settings.

-- 
/ Peter Schuller (@scode on twitter)


Re: Exception in Hadoop Word Count sample

2011-09-15 Thread Tharindu Mathew
Now I get this,

Any help would be greatly appreciated.

./bin/word_count
11/09/15 12:28:28 INFO WordCount: output reducer type: cassandra
11/09/15 12:28:29 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/09/15 12:28:30 INFO mapred.JobClient: Running job: job_local_0001
11/09/15 12:28:30 INFO mapred.MapTask: io.sort.mb = 100
11/09/15 12:28:30 INFO mapred.MapTask: data buffer = 79691776/99614720
11/09/15 12:28:30 INFO mapred.MapTask: record buffer = 262144/327680
11/09/15 12:28:30 WARN mapred.LocalJobRunner: job_local_0001
java.lang.RuntimeException: java.lang.UnsupportedOperationException: no
local connection available
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:132)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.UnsupportedOperationException: no local connection
available
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getLocation(ColumnFamilyRecordReader.java:176)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:113)
... 4 more
11/09/15 12:28:31 INFO mapred.JobClient:  map 0% reduce 0%
11/09/15 12:28:31 INFO mapred.JobClient: Job complete: job_local_0001
11/09/15 12:28:31 INFO mapred.JobClient: Counters: 0
11/09/15 12:28:31 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
11/09/15 12:28:32 INFO mapred.JobClient: Running job: job_local_0002
11/09/15 12:28:32 INFO mapred.MapTask: io.sort.mb = 100
11/09/15 12:28:32 INFO mapred.MapTask: data buffer = 79691776/99614720
11/09/15 12:28:32 INFO mapred.MapTask: record buffer = 262144/327680
11/09/15 12:28:32 WARN mapred.LocalJobRunner: job_local_0002
java.lang.RuntimeException: java.lang.UnsupportedOperationException: no
local connection available
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:132)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.UnsupportedOperationException: no local connection
available
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getLocation(ColumnFamilyRecordReader.java:176)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:113)
... 4 more
11/09/15 12:28:33 INFO mapred.JobClient:  map 0% reduce 0%
11/09/15 12:28:33 INFO mapred.JobClient: Job complete: job_local_0002
11/09/15 12:28:33 INFO mapred.JobClient: Counters: 0
11/09/15 12:28:33 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
11/09/15 12:28:34 INFO mapred.JobClient: Running job: job_local_0003
11/09/15 12:28:34 INFO mapred.MapTask: io.sort.mb = 100
11/09/15 12:28:34 INFO mapred.MapTask: data buffer = 79691776/99614720
11/09/15 12:28:34 INFO mapred.MapTask: record buffer = 262144/327680
11/09/15 12:28:34 WARN mapred.LocalJobRunner: job_local_0003
java.lang.RuntimeException: java.lang.UnsupportedOperationException: no
local connection available
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:132)
at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.initialize(MapTask.java:418)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:620)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)
Caused by: java.lang.UnsupportedOperationException: no local connection
available
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.getLocation(ColumnFamilyRecordReader.java:176)
at
org.apache.cassandra.hadoop.ColumnFamilyRecordReader.initialize(ColumnFamilyRecordReader.java:113)
... 4 more
11/09/15 12:28:35 INFO mapred.JobClient:  map 0% reduce 0%
11/09/15 12:28:35 INFO mapred.JobClient: Job complete: job_local_0003
11/09/15 12:28:35 INFO mapred.JobClient: Counters: 0
11/09/15 12:28:35 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with
processName=JobTracker, sessionId= - already initialized
11/09/15 12:28:36 INFO mapred.JobClient: Running job: job_local_0004
11/09/15 12:28:36 INFO mapred.MapTask: io.sort.mb = 100
11/09/15 12:28:37 INFO mapred.MapTask: data buffer = 79691776/99614720
11/09/15 12:28:37 INFO mapred.MapTask: record buffer = 262144/327680
11/09/15 12:28:37 WARN mapred.LocalJobRunner: job_local_0004
java.lang.RuntimeException: java.lang.UnsupportedOperationException: no
local connection available
at

Re: how did hprof file generated?

2011-09-15 Thread Yan Chunlu
got it! thanks!

On Thu, Sep 15, 2011 at 4:10 PM, Peter Schuller peter.schul...@infidyne.com
 wrote:

  in one of my node, I found many hprof files in the cassandra installation
  directory, they are using as much as 200GB disk space.  other nodes
 didn't
  have those files.
  turns out that those files are used for memory analyzing, not sure how
 they
  are generated?

 You're probably getting OutOfMemory exceptions. Cassandra by default
 runs with -XX:+HeapDumpOnOutOfMemory (or some such, I forget exactly
 what it's called). If this is the case, you probably need to increase
 your heap size or adjust Cassandra settings.

 --
 / Peter Schuller (@scode on twitter)



New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
Hi.

We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our
production environment for a few months.  It's been consistently stable
during this period, particularly once we got out maintenance strategy fully
worked out (per node, one repair a week, one major compaction a week, the
latter due to the nature of our data model and usage).  While this cluster
started, back in June or so, on the 0.7 series, it's been running 0.8.3 for
a while now with no issues.  We upgraded to 0.8.5 two days ago, having
tested the upgrade in our staging cluster (with an otherwise identical
configuration) previously and verified that our application's various use
cases appeared successful.

One of our nodes suffered a disk failure yesterday.  We attempted to replace
the dead node by placing a new node at OldNode.initial_token - 1 with
auto_bootstrap on.  A few things went awry from there:

1. We never saw the new node in bootstrap mode; it became available pretty
much immediately upon joining the ring, and never reported a joining
state.  I did verify that auto_bootstrap was on.

2. I mistakenly ran repair on the new node rather than removetoken on the
old node, due to a delightful mental error.  The repair got nowhere fast, as
it attempts to repair against the down node which throws an exception.  So I
interrupted the repair, restarted the node to clear any pending validation
compactions, and...

3. Ran removetoken for the old node.

4. We let this run for some time and saw eventually that all the nodes
appeared to be done various compactions and were stuck at streaming.  Many
streams listed as open, none making any progress.

5.  I observed an Rpc-related exception on the new node (where the
removetoken was launched) and concluded that the streams were broken so the
process wouldn't ever finish.

6. Ran a removetoken force to get the dead node out of the mix.  No
problems.

7. Ran a repair on the new node.

8. Validations ran, streams opened up, and again things got stuck in
streaming, hanging for over an hour with no progress.

9. Musing that lingering tasks from the removetoken could be a factor, I
performed a rolling restart and attempted a repair again.

10. Same problem.  Did another rolling restart and attempted a fresh repair
on the most important column family alone.

11. Same problem.  Streams included CFs not specified, so I guess they must
be for hinted handoff.

In concluding that streaming is stuck, I've observed:
- streams will be open to the new node from other nodes, but the new node
doesn't list them
- streams will be open to the other nodes from the new node, but the other
nodes don't list them
- the streams reported may make some initial progress, but then they hang at
a particular point and do not move on for an hour or more.
- The logs report repair-related activity, until NPEs on incoming TCP
connections show up, which appear likely to be the culprit.

I can provide more exact details when I'm done commuting.

With streaming broken on this node, I'm unable to run repairs, which is
obviously problematic.  The application didn't suffer any operational issues
as a consequence of this, but I need to review the overnight results to
verify we're not suffering data loss (I doubt we are).

At this point, I'm considering a couple options:
1. Remove the new node and let the adjacent node take over its range
2. Bring the new node down, add a new one in front of it, and properly
removetoken the problematic one.
3. Bring the new node down, remove all its data except for the system
keyspace, then bring it back up and repair it.
4. Revert to 0.8.3 and see if that helps.

Recommendations?

Thanks.
- Ethan


Re: New node unable to stream (0.8.5)

2011-09-15 Thread Sylvain Lebresne
On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com wrote:
 Hi.

 We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our
 production environment for a few months.  It's been consistently stable
 during this period, particularly once we got out maintenance strategy fully
 worked out (per node, one repair a week, one major compaction a week, the
 latter due to the nature of our data model and usage).  While this cluster
 started, back in June or so, on the 0.7 series, it's been running 0.8.3 for
 a while now with no issues.  We upgraded to 0.8.5 two days ago, having
 tested the upgrade in our staging cluster (with an otherwise identical
 configuration) previously and verified that our application's various use
 cases appeared successful.

 One of our nodes suffered a disk failure yesterday.  We attempted to replace
 the dead node by placing a new node at OldNode.initial_token - 1 with
 auto_bootstrap on.  A few things went awry from there:

 1. We never saw the new node in bootstrap mode; it became available pretty
 much immediately upon joining the ring, and never reported a joining
 state.  I did verify that auto_bootstrap was on.

 2. I mistakenly ran repair on the new node rather than removetoken on the
 old node, due to a delightful mental error.  The repair got nowhere fast, as
 it attempts to repair against the down node which throws an exception.  So I
 interrupted the repair, restarted the node to clear any pending validation
 compactions, and...

 3. Ran removetoken for the old node.

 4. We let this run for some time and saw eventually that all the nodes
 appeared to be done various compactions and were stuck at streaming.  Many
 streams listed as open, none making any progress.

 5.  I observed an Rpc-related exception on the new node (where the
 removetoken was launched) and concluded that the streams were broken so the
 process wouldn't ever finish.

 6. Ran a removetoken force to get the dead node out of the mix.  No
 problems.

 7. Ran a repair on the new node.

 8. Validations ran, streams opened up, and again things got stuck in
 streaming, hanging for over an hour with no progress.

 9. Musing that lingering tasks from the removetoken could be a factor, I
 performed a rolling restart and attempted a repair again.

 10. Same problem.  Did another rolling restart and attempted a fresh repair
 on the most important column family alone.

 11. Same problem.  Streams included CFs not specified, so I guess they must
 be for hinted handoff.

 In concluding that streaming is stuck, I've observed:
 - streams will be open to the new node from other nodes, but the new node
 doesn't list them
 - streams will be open to the other nodes from the new node, but the other
 nodes don't list them
 - the streams reported may make some initial progress, but then they hang at
 a particular point and do not move on for an hour or more.
 - The logs report repair-related activity, until NPEs on incoming TCP
 connections show up, which appear likely to be the culprit.

Can you send the stack trace from those NPE.


 I can provide more exact details when I'm done commuting.

 With streaming broken on this node, I'm unable to run repairs, which is
 obviously problematic.  The application didn't suffer any operational issues
 as a consequence of this, but I need to review the overnight results to
 verify we're not suffering data loss (I doubt we are).

 At this point, I'm considering a couple options:
 1. Remove the new node and let the adjacent node take over its range
 2. Bring the new node down, add a new one in front of it, and properly
 removetoken the problematic one.
 3. Bring the new node down, remove all its data except for the system
 keyspace, then bring it back up and repair it.
 4. Revert to 0.8.3 and see if that helps.

 Recommendations?

 Thanks.
 - Ethan



Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
Here's a typical log slice (not terribly informative, I fear):

  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.java
 (l
 ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
 (299
 90798416657667504332586989223299634,54296681768153272037430773234349600451]
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
 181)
 Stream context metadata
 [/mnt/cassandra/data/events_production/FitsByShip-g-1
 0-Data.db sections=88 progress=0/11707163 - 0%,
 /mnt/cassandra/data/events_pr
 oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
 /mnt/c
 assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
 progress=0/
 6918814 - 0%, /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db
 s
 ections=260 progress=0/9091780 - 0%], 4 sstables.
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
 (lin
 e 174) Streaming to /10.34.90.8
 ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
 (line
 139) Fatal exception in thread Thread[Thread-56,5,main]
 java.lang.NullPointerException
 at
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
 onnection.java:174)
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
 ection.java:114)



Not sure if the exception is related to the outbound streaming above; other
nodes are actively trying to stream to this node, so perhaps it comes from
those and temporal adjacency to the outbound stream is just coincidental.  I
have other snippets that look basically identical to the above, except if I
look at the logs to which this node is trying to stream, I see that it has
concurrently opened a stream in the other direction, which could be the one
that the exception pertains to.


On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com wrote:
  Hi.
 
  We've been running a 7-node cluster with RF 3, QUORUM reads/writes in our
  production environment for a few months.  It's been consistently stable
  during this period, particularly once we got out maintenance strategy
 fully
  worked out (per node, one repair a week, one major compaction a week, the
  latter due to the nature of our data model and usage).  While this
 cluster
  started, back in June or so, on the 0.7 series, it's been running 0.8.3
 for
  a while now with no issues.  We upgraded to 0.8.5 two days ago, having
  tested the upgrade in our staging cluster (with an otherwise identical
  configuration) previously and verified that our application's various use
  cases appeared successful.
 
  One of our nodes suffered a disk failure yesterday.  We attempted to
 replace
  the dead node by placing a new node at OldNode.initial_token - 1 with
  auto_bootstrap on.  A few things went awry from there:
 
  1. We never saw the new node in bootstrap mode; it became available
 pretty
  much immediately upon joining the ring, and never reported a joining
  state.  I did verify that auto_bootstrap was on.
 
  2. I mistakenly ran repair on the new node rather than removetoken on the
  old node, due to a delightful mental error.  The repair got nowhere fast,
 as
  it attempts to repair against the down node which throws an exception.
  So I
  interrupted the repair, restarted the node to clear any pending
 validation
  compactions, and...
 
  3. Ran removetoken for the old node.
 
  4. We let this run for some time and saw eventually that all the nodes
  appeared to be done various compactions and were stuck at streaming.
 Many
  streams listed as open, none making any progress.
 
  5.  I observed an Rpc-related exception on the new node (where the
  removetoken was launched) and concluded that the streams were broken so
 the
  process wouldn't ever finish.
 
  6. Ran a removetoken force to get the dead node out of the mix.  No
  problems.
 
  7. Ran a repair on the new node.
 
  8. Validations ran, streams opened up, and again things got stuck in
  streaming, hanging for over an hour with no progress.
 
  9. Musing that lingering tasks from the removetoken could be a factor, I
  performed a rolling restart and attempted a repair again.
 
  10. Same problem.  Did another rolling restart and attempted a fresh
 repair
  on the most important column family alone.
 
  11. Same problem.  Streams included CFs not specified, so I guess they
 must
  be for hinted handoff.
 
  In concluding that streaming is stuck, I've observed:
  - streams will be open to the new node from other nodes, but the new node
  doesn't list them
  - streams will be open to the other nodes from the new node, but the
 other
  nodes don't list them
  - the streams reported may make some initial progress, but then they hang
 at
  a particular point and do not move on for an hour or more.
  - The logs report repair-related activity, until NPEs on incoming TCP
  connections show up, which appear likely to be the 

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
I just noticed the following from one of Jonathan Ellis' messages yesterday:

 Added to NEWS:

- After upgrading, run nodetool scrub against each node before running
  repair, moving nodes, or adding new ones.


We did not do this, as it was not indicated as necessary in the news when we
were dealing with the upgrade.

So perhaps I need to scrub everything before going any further, though the
question is what to do with the problematic node.  Additionally, it would be
helpful to know if scrub will affect the hinted handoffs that have
accumulated, as these seem likely to be part of the set of failing streams.

On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com wrote:

 Here's a typical log slice (not terribly informative, I fear):

  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.java
 (l
 ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
 (299

 90798416657667504332586989223299634,54296681768153272037430773234349600451]
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
 181)
 Stream context metadata
 [/mnt/cassandra/data/events_production/FitsByShip-g-1
 0-Data.db sections=88 progress=0/11707163 - 0%,
 /mnt/cassandra/data/events_pr
 oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
 /mnt/c
 assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
 progress=0/
 6918814 - 0%,
 /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
 ections=260 progress=0/9091780 - 0%], 4 sstables.
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
 (lin
 e 174) Streaming to /10.34.90.8
 ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
 (line
 139) Fatal exception in thread Thread[Thread-56,5,main]
 java.lang.NullPointerException
 at
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
 onnection.java:174)
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
 ection.java:114)



 Not sure if the exception is related to the outbound streaming above; other
 nodes are actively trying to stream to this node, so perhaps it comes from
 those and temporal adjacency to the outbound stream is just coincidental.  I
 have other snippets that look basically identical to the above, except if I
 look at the logs to which this node is trying to stream, I see that it has
 concurrently opened a stream in the other direction, which could be the one
 that the exception pertains to.


 On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne sylv...@datastax.comwrote:

 On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com wrote:
  Hi.
 
  We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
 our
  production environment for a few months.  It's been consistently stable
  during this period, particularly once we got out maintenance strategy
 fully
  worked out (per node, one repair a week, one major compaction a week,
 the
  latter due to the nature of our data model and usage).  While this
 cluster
  started, back in June or so, on the 0.7 series, it's been running 0.8.3
 for
  a while now with no issues.  We upgraded to 0.8.5 two days ago, having
  tested the upgrade in our staging cluster (with an otherwise identical
  configuration) previously and verified that our application's various
 use
  cases appeared successful.
 
  One of our nodes suffered a disk failure yesterday.  We attempted to
 replace
  the dead node by placing a new node at OldNode.initial_token - 1 with
  auto_bootstrap on.  A few things went awry from there:
 
  1. We never saw the new node in bootstrap mode; it became available
 pretty
  much immediately upon joining the ring, and never reported a joining
  state.  I did verify that auto_bootstrap was on.
 
  2. I mistakenly ran repair on the new node rather than removetoken on
 the
  old node, due to a delightful mental error.  The repair got nowhere
 fast, as
  it attempts to repair against the down node which throws an exception.
  So I
  interrupted the repair, restarted the node to clear any pending
 validation
  compactions, and...
 
  3. Ran removetoken for the old node.
 
  4. We let this run for some time and saw eventually that all the nodes
  appeared to be done various compactions and were stuck at streaming.
 Many
  streams listed as open, none making any progress.
 
  5.  I observed an Rpc-related exception on the new node (where the
  removetoken was launched) and concluded that the streams were broken so
 the
  process wouldn't ever finish.
 
  6. Ran a removetoken force to get the dead node out of the mix.  No
  problems.
 
  7. Ran a repair on the new node.
 
  8. Validations ran, streams opened up, and again things got stuck in
  streaming, hanging for over an hour with no progress.
 
  9. Musing that lingering tasks from the removetoken could be a factor, I
  performed a rolling restart and attempted a repair again.
 
  10. Same problem.  Did another rolling 

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
After further review, I'm definitely going to scrub all the original nodes
in the cluster.

We've lost some data as a result of this situation.  It can be restored, but
the question is what to do with the problematic new node first.  I don't
particularly care about the data that's on it, since I'm going to re-import
the critical data from files anyway, and then I can recreate derivative data
afterwards.  So it's purely a matter of getting the cluster healthy again as
quickly as possible so I can begin that import process.

Any issue with running scrubs on multiple nodes at a time, provided they
aren't replication neighbors?

On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe et...@the-rowes.com wrote:

 I just noticed the following from one of Jonathan Ellis' messages
 yesterday:

 Added to NEWS:

- After upgrading, run nodetool scrub against each node before running
  repair, moving nodes, or adding new ones.


 We did not do this, as it was not indicated as necessary in the news when
 we were dealing with the upgrade.

 So perhaps I need to scrub everything before going any further, though the
 question is what to do with the problematic node.  Additionally, it would be
 helpful to know if scrub will affect the hinted handoffs that have
 accumulated, as these seem likely to be part of the set of failing streams.

 On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com wrote:

 Here's a typical log slice (not terribly informative, I fear):

  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
 AntiEntropyService.java (l
 ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
 (299

 90798416657667504332586989223299634,54296681768153272037430773234349600451]
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
 181)
 Stream context metadata
 [/mnt/cassandra/data/events_production/FitsByShip-g-1
 0-Data.db sections=88 progress=0/11707163 - 0%,
 /mnt/cassandra/data/events_pr
 oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
 /mnt/c
 assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
 progress=0/
 6918814 - 0%,
 /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
 ections=260 progress=0/9091780 - 0%], 4 sstables.
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
 (lin
 e 174) Streaming to /10.34.90.8
 ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
 (line
 139) Fatal exception in thread Thread[Thread-56,5,main]
 java.lang.NullPointerException
 at
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
 onnection.java:174)
 at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
 ection.java:114)



 Not sure if the exception is related to the outbound streaming above;
 other nodes are actively trying to stream to this node, so perhaps it comes
 from those and temporal adjacency to the outbound stream is just
 coincidental.  I have other snippets that look basically identical to the
 above, except if I look at the logs to which this node is trying to stream,
 I see that it has concurrently opened a stream in the other direction, which
 could be the one that the exception pertains to.


 On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne 
 sylv...@datastax.comwrote:

 On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com wrote:
  Hi.
 
  We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
 our
  production environment for a few months.  It's been consistently stable
  during this period, particularly once we got out maintenance strategy
 fully
  worked out (per node, one repair a week, one major compaction a week,
 the
  latter due to the nature of our data model and usage).  While this
 cluster
  started, back in June or so, on the 0.7 series, it's been running 0.8.3
 for
  a while now with no issues.  We upgraded to 0.8.5 two days ago, having
  tested the upgrade in our staging cluster (with an otherwise identical
  configuration) previously and verified that our application's various
 use
  cases appeared successful.
 
  One of our nodes suffered a disk failure yesterday.  We attempted to
 replace
  the dead node by placing a new node at OldNode.initial_token - 1 with
  auto_bootstrap on.  A few things went awry from there:
 
  1. We never saw the new node in bootstrap mode; it became available
 pretty
  much immediately upon joining the ring, and never reported a joining
  state.  I did verify that auto_bootstrap was on.
 
  2. I mistakenly ran repair on the new node rather than removetoken on
 the
  old node, due to a delightful mental error.  The repair got nowhere
 fast, as
  it attempts to repair against the down node which throws an exception.
  So I
  interrupted the repair, restarted the node to clear any pending
 validation
  compactions, and...
 
  3. Ran removetoken for the old node.
 
  4. We let this run for some time and saw eventually that all the nodes
  appeared to be done various 

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Jonathan Ellis
That means we missed a place we needed to special-case for backwards
compatibility -- the workaround is, add an empty encryption_options section
to cassandra.yaml:

encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra

Created https://issues.apache.org/jira/browse/CASSANDRA-3212 to fix this.

On Thu, Sep 15, 2011 at 7:13 AM, Ethan Rowe et...@the-rowes.com wrote:
 Here's a typical log slice (not terribly informative, I fear):

  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106 AntiEntropyService.java
 (l
 ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
 (299

 90798416657667504332586989223299634,54296681768153272037430773234349600451]
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
 181)
 Stream context metadata
 [/mnt/cassandra/data/events_production/FitsByShip-g-1
 0-Data.db sections=88 progress=0/11707163 - 0%,
 /mnt/cassandra/data/events_pr
 oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
 /mnt/c
 assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
 progress=0/
 6918814 - 0%,
 /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
 ections=260 progress=0/9091780 - 0%], 4 sstables.
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
 (lin
 e 174) Streaming to /10.34.90.8
 ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
 (line
 139) Fatal exception in thread Thread[Thread-56,5,main]
 java.lang.NullPointerException
         at
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
 onnection.java:174)
         at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
 ection.java:114)

 Not sure if the exception is related to the outbound streaming above; other
 nodes are actively trying to stream to this node, so perhaps it comes from
 those and temporal adjacency to the outbound stream is just coincidental.  I
 have other snippets that look basically identical to the above, except if I
 look at the logs to which this node is trying to stream, I see that it has
 concurrently opened a stream in the other direction, which could be the one
 that the exception pertains to.

 On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:

 On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com wrote:
  Hi.
 
  We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
  our
  production environment for a few months.  It's been consistently stable
  during this period, particularly once we got out maintenance strategy
  fully
  worked out (per node, one repair a week, one major compaction a week,
  the
  latter due to the nature of our data model and usage).  While this
  cluster
  started, back in June or so, on the 0.7 series, it's been running 0.8.3
  for
  a while now with no issues.  We upgraded to 0.8.5 two days ago, having
  tested the upgrade in our staging cluster (with an otherwise identical
  configuration) previously and verified that our application's various
  use
  cases appeared successful.
 
  One of our nodes suffered a disk failure yesterday.  We attempted to
  replace
  the dead node by placing a new node at OldNode.initial_token - 1 with
  auto_bootstrap on.  A few things went awry from there:
 
  1. We never saw the new node in bootstrap mode; it became available
  pretty
  much immediately upon joining the ring, and never reported a joining
  state.  I did verify that auto_bootstrap was on.
 
  2. I mistakenly ran repair on the new node rather than removetoken on
  the
  old node, due to a delightful mental error.  The repair got nowhere
  fast, as
  it attempts to repair against the down node which throws an exception.
   So I
  interrupted the repair, restarted the node to clear any pending
  validation
  compactions, and...
 
  3. Ran removetoken for the old node.
 
  4. We let this run for some time and saw eventually that all the nodes
  appeared to be done various compactions and were stuck at streaming.
  Many
  streams listed as open, none making any progress.
 
  5.  I observed an Rpc-related exception on the new node (where the
  removetoken was launched) and concluded that the streams were broken so
  the
  process wouldn't ever finish.
 
  6. Ran a removetoken force to get the dead node out of the mix.  No
  problems.
 
  7. Ran a repair on the new node.
 
  8. Validations ran, streams opened up, and again things got stuck in
  streaming, hanging for over an hour with no progress.
 
  9. Musing that lingering tasks from the removetoken could be a factor, I
  performed a rolling restart and attempted a repair again.
 
  10. Same problem.  Did another rolling restart and attempted a fresh
  repair
  on the most important column family alone.
 
  11. Same problem.  Streams included CFs not specified, so I guess they
  must
  be for hinted handoff.
 
  

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Jonathan Ellis
Where did the data loss come in?

Scrub is safe to run in parallel.

On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe et...@the-rowes.com wrote:
 After further review, I'm definitely going to scrub all the original nodes
 in the cluster.
 We've lost some data as a result of this situation.  It can be restored, but
 the question is what to do with the problematic new node first.  I don't
 particularly care about the data that's on it, since I'm going to re-import
 the critical data from files anyway, and then I can recreate derivative data
 afterwards.  So it's purely a matter of getting the cluster healthy again as
 quickly as possible so I can begin that import process.
 Any issue with running scrubs on multiple nodes at a time, provided they
 aren't replication neighbors?
 On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe et...@the-rowes.com wrote:

 I just noticed the following from one of Jonathan Ellis' messages
 yesterday:

 Added to NEWS:

    - After upgrading, run nodetool scrub against each node before running
      repair, moving nodes, or adding new ones.


 We did not do this, as it was not indicated as necessary in the news when
 we were dealing with the upgrade.
 So perhaps I need to scrub everything before going any further, though the
 question is what to do with the problematic node.  Additionally, it would be
 helpful to know if scrub will affect the hinted handoffs that have
 accumulated, as these seem likely to be part of the set of failing streams.
 On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com wrote:

 Here's a typical log slice (not terribly informative, I fear):

  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
 AntiEntropyService.java (l
 ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8 for
 (299

 90798416657667504332586989223299634,54296681768153272037430773234349600451]
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
 181)
 Stream context metadata
 [/mnt/cassandra/data/events_production/FitsByShip-g-1
 0-Data.db sections=88 progress=0/11707163 - 0%,
 /mnt/cassandra/data/events_pr
 oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
 /mnt/c
 assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
 progress=0/
 6918814 - 0%,
 /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
 ections=260 progress=0/9091780 - 0%], 4 sstables.
  INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
 (lin
 e 174) Streaming to /10.34.90.8
 ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
 (line
 139) Fatal exception in thread Thread[Thread-56,5,main]
 java.lang.NullPointerException
         at
 org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
 onnection.java:174)
         at
 org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
 ection.java:114)

 Not sure if the exception is related to the outbound streaming above;
 other nodes are actively trying to stream to this node, so perhaps it comes
 from those and temporal adjacency to the outbound stream is just
 coincidental.  I have other snippets that look basically identical to the
 above, except if I look at the logs to which this node is trying to stream,
 I see that it has concurrently opened a stream in the other direction, which
 could be the one that the exception pertains to.

 On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne sylv...@datastax.com
 wrote:

 On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com wrote:
  Hi.
 
  We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
  our
  production environment for a few months.  It's been consistently
  stable
  during this period, particularly once we got out maintenance strategy
  fully
  worked out (per node, one repair a week, one major compaction a week,
  the
  latter due to the nature of our data model and usage).  While this
  cluster
  started, back in June or so, on the 0.7 series, it's been running
  0.8.3 for
  a while now with no issues.  We upgraded to 0.8.5 two days ago, having
  tested the upgrade in our staging cluster (with an otherwise identical
  configuration) previously and verified that our application's various
  use
  cases appeared successful.
 
  One of our nodes suffered a disk failure yesterday.  We attempted to
  replace
  the dead node by placing a new node at OldNode.initial_token - 1 with
  auto_bootstrap on.  A few things went awry from there:
 
  1. We never saw the new node in bootstrap mode; it became available
  pretty
  much immediately upon joining the ring, and never reported a joining
  state.  I did verify that auto_bootstrap was on.
 
  2. I mistakenly ran repair on the new node rather than removetoken on
  the
  old node, due to a delightful mental error.  The repair got nowhere
  fast, as
  it attempts to repair against the down node which throws an exception.
   So I
  interrupted the repair, restarted the node to clear any pending
  validation
  

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
Thanks, Jonathan.  I'll try the workaround and see if that gets the streams
flowing properly.

As I mentioned before, we did not run scrub yet.  What is the consequence of
letting the streams from the hinted handoffs complete if scrub hasn't been
run on these nodes?

I'm currently running scrub on one node to get a sense of the time frame.

Thanks again.
- Ethan

On Thu, Sep 15, 2011 at 9:09 AM, Jonathan Ellis jbel...@gmail.com wrote:

 That means we missed a place we needed to special-case for backwards
 compatibility -- the workaround is, add an empty encryption_options section
 to cassandra.yaml:

 encryption_options:
internode_encryption: none
keystore: conf/.keystore
keystore_password: cassandra
truststore: conf/.truststore
truststore_password: cassandra

 Created https://issues.apache.org/jira/browse/CASSANDRA-3212 to fix this.

 On Thu, Sep 15, 2011 at 7:13 AM, Ethan Rowe et...@the-rowes.com wrote:
  Here's a typical log slice (not terribly informative, I fear):
 
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
 AntiEntropyService.java
  (l
  ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8for
  (299
 
 
 90798416657667504332586989223299634,54296681768153272037430773234349600451]
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
  181)
  Stream context metadata
  [/mnt/cassandra/data/events_production/FitsByShip-g-1
  0-Data.db sections=88 progress=0/11707163 - 0%,
  /mnt/cassandra/data/events_pr
  oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
  /mnt/c
  assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
  progress=0/
  6918814 - 0%,
  /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
  ections=260 progress=0/9091780 - 0%], 4 sstables.
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428 StreamOutSession.java
  (lin
  e 174) Streaming to /10.34.90.8
  ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
  (line
  139) Fatal exception in thread Thread[Thread-56,5,main]
  java.lang.NullPointerException
  at
  org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
  onnection.java:174)
  at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
  ection.java:114)
 
  Not sure if the exception is related to the outbound streaming above;
 other
  nodes are actively trying to stream to this node, so perhaps it comes
 from
  those and temporal adjacency to the outbound stream is just coincidental.
  I
  have other snippets that look basically identical to the above, except if
 I
  look at the logs to which this node is trying to stream, I see that it
 has
  concurrently opened a stream in the other direction, which could be the
 one
  that the exception pertains to.
 
  On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne sylv...@datastax.com
  wrote:
 
  On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com
 wrote:
   Hi.
  
   We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
   our
   production environment for a few months.  It's been consistently
 stable
   during this period, particularly once we got out maintenance strategy
   fully
   worked out (per node, one repair a week, one major compaction a week,
   the
   latter due to the nature of our data model and usage).  While this
   cluster
   started, back in June or so, on the 0.7 series, it's been running
 0.8.3
   for
   a while now with no issues.  We upgraded to 0.8.5 two days ago, having
   tested the upgrade in our staging cluster (with an otherwise identical
   configuration) previously and verified that our application's various
   use
   cases appeared successful.
  
   One of our nodes suffered a disk failure yesterday.  We attempted to
   replace
   the dead node by placing a new node at OldNode.initial_token - 1 with
   auto_bootstrap on.  A few things went awry from there:
  
   1. We never saw the new node in bootstrap mode; it became available
   pretty
   much immediately upon joining the ring, and never reported a joining
   state.  I did verify that auto_bootstrap was on.
  
   2. I mistakenly ran repair on the new node rather than removetoken on
   the
   old node, due to a delightful mental error.  The repair got nowhere
   fast, as
   it attempts to repair against the down node which throws an exception.
So I
   interrupted the repair, restarted the node to clear any pending
   validation
   compactions, and...
  
   3. Ran removetoken for the old node.
  
   4. We let this run for some time and saw eventually that all the nodes
   appeared to be done various compactions and were stuck at streaming.
   Many
   streams listed as open, none making any progress.
  
   5.  I observed an Rpc-related exception on the new node (where the
   removetoken was launched) and concluded that the streams were broken
 so
   the
   process wouldn't ever finish.
  
   6. Ran a removetoken force to get the dead node out of the mix.  

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
On Thu, Sep 15, 2011 at 9:21 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Where did the data loss come in?


The outcome of the analytical jobs run overnight while some of these repairs
were (not) running is consistent with what I would expect if perhaps 20-30%
of the source data was missing.  Given the strong consistency model we're
using, this is surprising to me, since the jobs did not report any read or
write failures.  I wonder if this is a consequence of the dead node missing
and the new node being operational but having received basically none of its
hinted handoff streams.  Perhaps with streaming fixed the data will
reappear, which would be a happy outcome, but if not, I can reimport the
critical stuff from files.

Scrub is safe to run in parallel.


Is it somewhat analogous to a major compaction in terms of I/O impact, with
perhaps less greedy use of disk space?


 On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe et...@the-rowes.com wrote:
  After further review, I'm definitely going to scrub all the original
 nodes
  in the cluster.
  We've lost some data as a result of this situation.  It can be restored,
 but
  the question is what to do with the problematic new node first.  I don't
  particularly care about the data that's on it, since I'm going to
 re-import
  the critical data from files anyway, and then I can recreate derivative
 data
  afterwards.  So it's purely a matter of getting the cluster healthy again
 as
  quickly as possible so I can begin that import process.
  Any issue with running scrubs on multiple nodes at a time, provided they
  aren't replication neighbors?
  On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe et...@the-rowes.com wrote:
 
  I just noticed the following from one of Jonathan Ellis' messages
  yesterday:
 
  Added to NEWS:
 
 - After upgrading, run nodetool scrub against each node before
 running
   repair, moving nodes, or adding new ones.
 
 
  We did not do this, as it was not indicated as necessary in the news
 when
  we were dealing with the upgrade.
  So perhaps I need to scrub everything before going any further, though
 the
  question is what to do with the problematic node.  Additionally, it
 would be
  helpful to know if scrub will affect the hinted handoffs that have
  accumulated, as these seem likely to be part of the set of failing
 streams.
  On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com
 wrote:
 
  Here's a typical log slice (not terribly informative, I fear):
 
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
  AntiEntropyService.java (l
  ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8for
  (299
 
 
 90798416657667504332586989223299634,54296681768153272037430773234349600451]
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java
 (line
  181)
  Stream context metadata
  [/mnt/cassandra/data/events_production/FitsByShip-g-1
  0-Data.db sections=88 progress=0/11707163 - 0%,
  /mnt/cassandra/data/events_pr
  oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
  /mnt/c
  assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
  progress=0/
  6918814 - 0%,
  /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
  ections=260 progress=0/9091780 - 0%], 4 sstables.
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428
 StreamOutSession.java
  (lin
  e 174) Streaming to /10.34.90.8
  ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
  (line
  139) Fatal exception in thread Thread[Thread-56,5,main]
  java.lang.NullPointerException
  at
  org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
  onnection.java:174)
  at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
  ection.java:114)
 
  Not sure if the exception is related to the outbound streaming above;
  other nodes are actively trying to stream to this node, so perhaps it
 comes
  from those and temporal adjacency to the outbound stream is just
  coincidental.  I have other snippets that look basically identical to
 the
  above, except if I look at the logs to which this node is trying to
 stream,
  I see that it has concurrently opened a stream in the other direction,
 which
  could be the one that the exception pertains to.
 
  On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne 
 sylv...@datastax.com
  wrote:
 
  On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com
 wrote:
   Hi.
  
   We've been running a 7-node cluster with RF 3, QUORUM reads/writes
 in
   our
   production environment for a few months.  It's been consistently
   stable
   during this period, particularly once we got out maintenance
 strategy
   fully
   worked out (per node, one repair a week, one major compaction a
 week,
   the
   latter due to the nature of our data model and usage).  While this
   cluster
   started, back in June or so, on the 0.7 series, it's been running
   0.8.3 for
   a while now with no issues.  We upgraded to 0.8.5 two days ago,
 

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Jonathan Ellis
Hinted handoff doesn't use streaming mode, so it doesn't care.

(Streaming to Cassandra means sending raw sstable file ranges to
another node.  HH just uses the normal column-based write path.)

On Thu, Sep 15, 2011 at 8:24 AM, Ethan Rowe et...@the-rowes.com wrote:
 Thanks, Jonathan.  I'll try the workaround and see if that gets the streams
 flowing properly.
 As I mentioned before, we did not run scrub yet.  What is the consequence of
 letting the streams from the hinted handoffs complete if scrub hasn't been
 run on these nodes?
 I'm currently running scrub on one node to get a sense of the time frame.
 Thanks again.
 - Ethan

 On Thu, Sep 15, 2011 at 9:09 AM, Jonathan Ellis jbel...@gmail.com wrote:

 That means we missed a place we needed to special-case for backwards
 compatibility -- the workaround is, add an empty encryption_options
 section
 to cassandra.yaml:

 encryption_options:
    internode_encryption: none
    keystore: conf/.keystore
    keystore_password: cassandra
    truststore: conf/.truststore
    truststore_password: cassandra

 Created https://issues.apache.org/jira/browse/CASSANDRA-3212 to fix this.

 On Thu, Sep 15, 2011 at 7:13 AM, Ethan Rowe et...@the-rowes.com wrote:
  Here's a typical log slice (not terribly informative, I fear):
 
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
  AntiEntropyService.java
  (l
  ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8
  for
  (299
 
 
  90798416657667504332586989223299634,54296681768153272037430773234349600451]
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java (line
  181)
  Stream context metadata
  [/mnt/cassandra/data/events_production/FitsByShip-g-1
  0-Data.db sections=88 progress=0/11707163 - 0%,
  /mnt/cassandra/data/events_pr
  oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 - 0%,
  /mnt/c
  assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
  progress=0/
  6918814 - 0%,
  /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
  ections=260 progress=0/9091780 - 0%], 4 sstables.
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428
  StreamOutSession.java
  (lin
  e 174) Streaming to /10.34.90.8
  ERROR [Thread-56] 2011-09-15 05:41:38,515 AbstractCassandraDaemon.java
  (line
  139) Fatal exception in thread Thread[Thread-56,5,main]
  java.lang.NullPointerException
          at
  org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
  onnection.java:174)
          at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
  ection.java:114)
 
  Not sure if the exception is related to the outbound streaming above;
  other
  nodes are actively trying to stream to this node, so perhaps it comes
  from
  those and temporal adjacency to the outbound stream is just
  coincidental.  I
  have other snippets that look basically identical to the above, except
  if I
  look at the logs to which this node is trying to stream, I see that it
  has
  concurrently opened a stream in the other direction, which could be the
  one
  that the exception pertains to.
 
  On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne sylv...@datastax.com
  wrote:
 
  On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com
  wrote:
   Hi.
  
   We've been running a 7-node cluster with RF 3, QUORUM reads/writes in
   our
   production environment for a few months.  It's been consistently
   stable
   during this period, particularly once we got out maintenance strategy
   fully
   worked out (per node, one repair a week, one major compaction a week,
   the
   latter due to the nature of our data model and usage).  While this
   cluster
   started, back in June or so, on the 0.7 series, it's been running
   0.8.3
   for
   a while now with no issues.  We upgraded to 0.8.5 two days ago,
   having
   tested the upgrade in our staging cluster (with an otherwise
   identical
   configuration) previously and verified that our application's various
   use
   cases appeared successful.
  
   One of our nodes suffered a disk failure yesterday.  We attempted to
   replace
   the dead node by placing a new node at OldNode.initial_token - 1 with
   auto_bootstrap on.  A few things went awry from there:
  
   1. We never saw the new node in bootstrap mode; it became available
   pretty
   much immediately upon joining the ring, and never reported a
   joining
   state.  I did verify that auto_bootstrap was on.
  
   2. I mistakenly ran repair on the new node rather than removetoken on
   the
   old node, due to a delightful mental error.  The repair got nowhere
   fast, as
   it attempts to repair against the down node which throws an
   exception.
    So I
   interrupted the repair, restarted the node to clear any pending
   validation
   compactions, and...
  
   3. Ran removetoken for the old node.
  
   4. We let this run for some time and saw eventually that all the
   nodes
   appeared to be done various compactions and were stuck at streaming.
   Many
   

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Jonathan Ellis
If you added the new node as a seed, it would ignore bootstrap mode.
And bootstrap / repair *do* use streaming so you'll want to re-run
repair post-scrub.  (No need to re-bootstrap since you're repairing.)

Scrub is a little less heavyweight than major compaction but same
ballpark.  It runs sstable-at-a-time so (as long as you haven't been
in the habit of forcing majors) space should not be a concern.

On Thu, Sep 15, 2011 at 8:40 AM, Ethan Rowe et...@the-rowes.com wrote:
 On Thu, Sep 15, 2011 at 9:21 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Where did the data loss come in?

 The outcome of the analytical jobs run overnight while some of these repairs
 were (not) running is consistent with what I would expect if perhaps 20-30%
 of the source data was missing.  Given the strong consistency model we're
 using, this is surprising to me, since the jobs did not report any read or
 write failures.  I wonder if this is a consequence of the dead node missing
 and the new node being operational but having received basically none of its
 hinted handoff streams.  Perhaps with streaming fixed the data will
 reappear, which would be a happy outcome, but if not, I can reimport the
 critical stuff from files.

 Scrub is safe to run in parallel.

 Is it somewhat analogous to a major compaction in terms of I/O impact, with
 perhaps less greedy use of disk space?


 On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe et...@the-rowes.com wrote:
  After further review, I'm definitely going to scrub all the original
  nodes
  in the cluster.
  We've lost some data as a result of this situation.  It can be restored,
  but
  the question is what to do with the problematic new node first.  I don't
  particularly care about the data that's on it, since I'm going to
  re-import
  the critical data from files anyway, and then I can recreate derivative
  data
  afterwards.  So it's purely a matter of getting the cluster healthy
  again as
  quickly as possible so I can begin that import process.
  Any issue with running scrubs on multiple nodes at a time, provided they
  aren't replication neighbors?
  On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe et...@the-rowes.com wrote:
 
  I just noticed the following from one of Jonathan Ellis' messages
  yesterday:
 
  Added to NEWS:
 
     - After upgrading, run nodetool scrub against each node before
  running
       repair, moving nodes, or adding new ones.
 
 
  We did not do this, as it was not indicated as necessary in the news
  when
  we were dealing with the upgrade.
  So perhaps I need to scrub everything before going any further, though
  the
  question is what to do with the problematic node.  Additionally, it
  would be
  helpful to know if scrub will affect the hinted handoffs that have
  accumulated, as these seem likely to be part of the set of failing
  streams.
  On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com
  wrote:
 
  Here's a typical log slice (not terribly informative, I fear):
 
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
  AntiEntropyService.java (l
  ine 884) Performing streaming repair of 1003 ranges with /10.34.90.8
  for
  (299
 
 
  90798416657667504332586989223299634,54296681768153272037430773234349600451]
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java
  (line
  181)
  Stream context metadata
  [/mnt/cassandra/data/events_production/FitsByShip-g-1
  0-Data.db sections=88 progress=0/11707163 - 0%,
  /mnt/cassandra/data/events_pr
  oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 -
  0%,
  /mnt/c
  assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
  progress=0/
  6918814 - 0%,
  /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
  ections=260 progress=0/9091780 - 0%], 4 sstables.
   INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428
  StreamOutSession.java
  (lin
  e 174) Streaming to /10.34.90.8
  ERROR [Thread-56] 2011-09-15 05:41:38,515
  AbstractCassandraDaemon.java
  (line
  139) Fatal exception in thread Thread[Thread-56,5,main]
  java.lang.NullPointerException
          at
  org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
  onnection.java:174)
          at
  org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
  ection.java:114)
 
  Not sure if the exception is related to the outbound streaming above;
  other nodes are actively trying to stream to this node, so perhaps it
  comes
  from those and temporal adjacency to the outbound stream is just
  coincidental.  I have other snippets that look basically identical to
  the
  above, except if I look at the logs to which this node is trying to
  stream,
  I see that it has concurrently opened a stream in the other direction,
  which
  could be the one that the exception pertains to.
 
  On Thu, Sep 15, 2011 at 7:41 AM, Sylvain Lebresne
  sylv...@datastax.com
  wrote:
 
  On Thu, Sep 15, 2011 at 1:16 PM, Ethan Rowe et...@the-rowes.com
  wrote:
   Hi.
  
   We've been running a 7-node cluster 

Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe
On Thu, Sep 15, 2011 at 10:03 AM, Jonathan Ellis jbel...@gmail.com wrote:

 If you added the new node as a seed, it would ignore bootstrap mode.
 And bootstrap / repair *do* use streaming so you'll want to re-run
 repair post-scrub.  (No need to re-bootstrap since you're repairing.)


Ah, of course.  That's what happened; the chef recipe added the node to its
own seed list, which is a problem I thought we'd fixed but apparently not.
 That definitely explains the bootstrap issue.  But no matter, so long as
the repairs can eventually run.


 Scrub is a little less heavyweight than major compaction but same
 ballpark.  It runs sstable-at-a-time so (as long as you haven't been
 in the habit of forcing majors) space should not be a concern.


Cool.  We've deactivated all tasks against these nodes and will scrub them
all in parallel, apply the encryption options you specified, and see where
that gets us.  Thanks for the assistance.
- Ethan


 On Thu, Sep 15, 2011 at 8:40 AM, Ethan Rowe et...@the-rowes.com wrote:
  On Thu, Sep 15, 2011 at 9:21 AM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  Where did the data loss come in?
 
  The outcome of the analytical jobs run overnight while some of these
 repairs
  were (not) running is consistent with what I would expect if perhaps
 20-30%
  of the source data was missing.  Given the strong consistency model we're
  using, this is surprising to me, since the jobs did not report any read
 or
  write failures.  I wonder if this is a consequence of the dead node
 missing
  and the new node being operational but having received basically none of
 its
  hinted handoff streams.  Perhaps with streaming fixed the data will
  reappear, which would be a happy outcome, but if not, I can reimport the
  critical stuff from files.
 
  Scrub is safe to run in parallel.
 
  Is it somewhat analogous to a major compaction in terms of I/O impact,
 with
  perhaps less greedy use of disk space?
 
 
  On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe et...@the-rowes.com
 wrote:
   After further review, I'm definitely going to scrub all the original
   nodes
   in the cluster.
   We've lost some data as a result of this situation.  It can be
 restored,
   but
   the question is what to do with the problematic new node first.  I
 don't
   particularly care about the data that's on it, since I'm going to
   re-import
   the critical data from files anyway, and then I can recreate
 derivative
   data
   afterwards.  So it's purely a matter of getting the cluster healthy
   again as
   quickly as possible so I can begin that import process.
   Any issue with running scrubs on multiple nodes at a time, provided
 they
   aren't replication neighbors?
   On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe et...@the-rowes.com
 wrote:
  
   I just noticed the following from one of Jonathan Ellis' messages
   yesterday:
  
   Added to NEWS:
  
  - After upgrading, run nodetool scrub against each node before
   running
repair, moving nodes, or adding new ones.
  
  
   We did not do this, as it was not indicated as necessary in the news
   when
   we were dealing with the upgrade.
   So perhaps I need to scrub everything before going any further,
 though
   the
   question is what to do with the problematic node.  Additionally, it
   would be
   helpful to know if scrub will affect the hinted handoffs that have
   accumulated, as these seem likely to be part of the set of failing
   streams.
   On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com
   wrote:
  
   Here's a typical log slice (not terribly informative, I fear):
  
INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
   AntiEntropyService.java (l
   ine 884) Performing streaming repair of 1003 ranges with /
 10.34.90.8
   for
   (299
  
  
  
 90798416657667504332586989223299634,54296681768153272037430773234349600451]
INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java
   (line
   181)
   Stream context metadata
   [/mnt/cassandra/data/events_production/FitsByShip-g-1
   0-Data.db sections=88 progress=0/11707163 - 0%,
   /mnt/cassandra/data/events_pr
   oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 -
   0%,
   /mnt/c
   assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
   progress=0/
   6918814 - 0%,
   /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
   ections=260 progress=0/9091780 - 0%], 4 sstables.
INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428
   StreamOutSession.java
   (lin
   e 174) Streaming to /10.34.90.8
   ERROR [Thread-56] 2011-09-15 05:41:38,515
   AbstractCassandraDaemon.java
   (line
   139) Fatal exception in thread Thread[Thread-56,5,main]
   java.lang.NullPointerException
   at
   org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
   onnection.java:174)
   at
   org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
   ection.java:114)
  
   Not sure if the exception is related to the outbound 

Re: Configuring multi DC cluster

2011-09-15 Thread Konstantin Naryshkin
Wait, his nodes are going SC, SC, AT, AT. Shouldn't they go SC, AT, SC, AT? By 
which I mean that if he adds another node to the ring (or lowers the 
replication factor), he will have a node that is under-utilized. The rings in 
his data centers have the tokens:
SC: 0, 1
AT: 85070591730234615865843651857942052864, 
85070591730234615865843651857942052865

They should be:
SC: 0, 85070591730234615865843651857942052864
AT: 1, 85070591730234615865843651857942052865

Or did I forget/misread something?

- Original Message -
From: aaron morton aa...@thelastpickle.com
To: user@cassandra.apache.org
Sent: Tuesday, September 13, 2011 6:19:16 PM
Subject: Re: Configuring multi DC cluster

Looks good to me. Last time I checked the Partitioner did not take the DC into 
consideration https://issues.apache.org/jira/browse/CASSANDRA-3047


Good luck.







-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com


On 14/09/2011, at 8:41 AM, Anand Somani wrote:


Hi,

Just trying to setup a cluster of 4 nodes for multiDC scenario - with 2 nodes 
in each DC. This is all on the same box just for testing the configuration 
aspect. I have configured things as

• PropertyFile


• 127.0.0.4=SC:rack1 127.0.0.5=SC:rack2 127.0.0.6=AT:rack1 
127.0.0.7=AT:rack2 # default for unknown nodes default=SC:rack1
•
Setup initial tokens as - advised • configured keyspace with SC:2, AT:2
• ring looks like


• Address Status State Load Owns Token 
85070591730234615865843651857942052865 127.0.0.4 Up Normal 464.98 KB 50.00% 0 
127.0.0.5 Up Normal 464.98 KB 0.00% 1 127.0.0.6 Up Normal 464.99 KB 50.00% 
85070591730234615865843651857942052864 127.0.0.7 Up Normal 464.99 KB 0.00% 
85070591730234615865843651857942052865

Is that what I should expect the ring to look like? Is there anything else I 
should be testing/validating to make sure that things are configured correctly 
for NTS?

Thanks
Anand



[BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread Sylvain Lebresne
The Cassandra team is pleased to announce the release of the first beta for
the future Apache Cassandra 1.0.

Let me first stress that this is beta software and as such is *not* ready for
production use.

The goal of this release is to give a preview of what will be Cassandra 1.0
and more importantly to get wider testing before the final release. So please
help us make Cassandra 1.0 be the best it possibly could by testing this beta
release and reporting any problem you may encounter[3,4]. You can have a look
at the change log[1] and the release notes[2] to see where Cassandra 1.0
differs from the 0.8 series.

Apache Cassandra 1.0.0-beta1[5] is available as usual from the cassandra
website:

 http://cassandra.apache.org/download/

Thank you for your help in testing and have fun with it.

[1]: http://goo.gl/evCW0 (CHANGES.txt)
[2]: http://goo.gl/HbNsV (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: user@cassandra.apache.org
[5]: https://svn.apache.org/repos/asf/cassandra/tags/cassandra-1.0.0-beta1


Re: Configuring multi DC cluster

2011-09-15 Thread Anand Somani
You are right, good catch, thanks!

On Thu, Sep 15, 2011 at 8:28 AM, Konstantin Naryshkin
konstant...@a-bb.netwrote:

 Wait, his nodes are going SC, SC, AT, AT. Shouldn't they go SC, AT, SC, AT?
 By which I mean that if he adds another node to the ring (or lowers the
 replication factor), he will have a node that is under-utilized. The rings
 in his data centers have the tokens:
 SC: 0, 1
 AT: 85070591730234615865843651857942052864,
 85070591730234615865843651857942052865

 They should be:
 SC: 0, 85070591730234615865843651857942052864
 AT: 1, 85070591730234615865843651857942052865

 Or did I forget/misread something?

 - Original Message -
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org
 Sent: Tuesday, September 13, 2011 6:19:16 PM
 Subject: Re: Configuring multi DC cluster

 Looks good to me. Last time I checked the Partitioner did not take the DC
 into consideration https://issues.apache.org/jira/browse/CASSANDRA-3047


 Good luck.







 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com


 On 14/09/2011, at 8:41 AM, Anand Somani wrote:


 Hi,

 Just trying to setup a cluster of 4 nodes for multiDC scenario - with 2
 nodes in each DC. This is all on the same box just for testing the
 configuration aspect. I have configured things as

• PropertyFile


• 127.0.0.4=SC:rack1 127.0.0.5=SC:rack2 127.0.0.6=AT:rack1
 127.0.0.7=AT:rack2 # default for unknown nodes default=SC:rack1
•
 Setup initial tokens as - advised • configured keyspace with SC:2, AT:2
• ring looks like


• Address Status State Load Owns Token
 85070591730234615865843651857942052865 127.0.0.4 Up Normal 464.98 KB 50.00%
 0 127.0.0.5 Up Normal 464.98 KB 0.00% 1 127.0.0.6 Up Normal 464.99 KB 50.00%
 85070591730234615865843651857942052864 127.0.0.7 Up Normal 464.99 KB 0.00%
 85070591730234615865843651857942052865

 Is that what I should expect the ring to look like? Is there anything else
 I should be testing/validating to make sure that things are configured
 correctly for NTS?

 Thanks
 Anand




Re: New node unable to stream (0.8.5)

2011-09-15 Thread Ethan Rowe


 Cool.  We've deactivated all tasks against these nodes and will scrub them
 all in parallel, apply the encryption options you specified, and see where
 that gets us.  Thanks for the assistance.


To follow up:
* We scrubbed all the nodes
* We applied the encryption options specified
* A repair is continuing (for about an hour so far, perhaps more) on the
new, problematic node; it's successfully streaming data from its neighbors
and has built up a roughly equivalent data volume on disk

We'll see if the data is fully restored once this process completes.  Even
if it isn't, it seems likely that the cluster will be in a healthy state
soon, so we can reimport as necessary and we'll be out of the woods.

Now that I've said all that, something will inevitably go wrong, but until
that happens, thanks again for the feedback.
- Ethan



 On Thu, Sep 15, 2011 at 8:40 AM, Ethan Rowe et...@the-rowes.com wrote:
  On Thu, Sep 15, 2011 at 9:21 AM, Jonathan Ellis jbel...@gmail.com
 wrote:
 
  Where did the data loss come in?
 
  The outcome of the analytical jobs run overnight while some of these
 repairs
  were (not) running is consistent with what I would expect if perhaps
 20-30%
  of the source data was missing.  Given the strong consistency model
 we're
  using, this is surprising to me, since the jobs did not report any read
 or
  write failures.  I wonder if this is a consequence of the dead node
 missing
  and the new node being operational but having received basically none of
 its
  hinted handoff streams.  Perhaps with streaming fixed the data will
  reappear, which would be a happy outcome, but if not, I can reimport the
  critical stuff from files.
 
  Scrub is safe to run in parallel.
 
  Is it somewhat analogous to a major compaction in terms of I/O impact,
 with
  perhaps less greedy use of disk space?
 
 
  On Thu, Sep 15, 2011 at 8:08 AM, Ethan Rowe et...@the-rowes.com
 wrote:
   After further review, I'm definitely going to scrub all the original
   nodes
   in the cluster.
   We've lost some data as a result of this situation.  It can be
 restored,
   but
   the question is what to do with the problematic new node first.  I
 don't
   particularly care about the data that's on it, since I'm going to
   re-import
   the critical data from files anyway, and then I can recreate
 derivative
   data
   afterwards.  So it's purely a matter of getting the cluster healthy
   again as
   quickly as possible so I can begin that import process.
   Any issue with running scrubs on multiple nodes at a time, provided
 they
   aren't replication neighbors?
   On Thu, Sep 15, 2011 at 8:18 AM, Ethan Rowe et...@the-rowes.com
 wrote:
  
   I just noticed the following from one of Jonathan Ellis' messages
   yesterday:
  
   Added to NEWS:
  
  - After upgrading, run nodetool scrub against each node before
   running
repair, moving nodes, or adding new ones.
  
  
   We did not do this, as it was not indicated as necessary in the news
   when
   we were dealing with the upgrade.
   So perhaps I need to scrub everything before going any further,
 though
   the
   question is what to do with the problematic node.  Additionally, it
   would be
   helpful to know if scrub will affect the hinted handoffs that have
   accumulated, as these seem likely to be part of the set of failing
   streams.
   On Thu, Sep 15, 2011 at 8:13 AM, Ethan Rowe et...@the-rowes.com
   wrote:
  
   Here's a typical log slice (not terribly informative, I fear):
  
INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,106
   AntiEntropyService.java (l
   ine 884) Performing streaming repair of 1003 ranges with /
 10.34.90.8
   for
   (299
  
  
  
 90798416657667504332586989223299634,54296681768153272037430773234349600451]
INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,427 StreamOut.java
   (line
   181)
   Stream context metadata
   [/mnt/cassandra/data/events_production/FitsByShip-g-1
   0-Data.db sections=88 progress=0/11707163 - 0%,
   /mnt/cassandra/data/events_pr
   oduction/FitsByShip-g-11-Data.db sections=169 progress=0/6133240 -
   0%,
   /mnt/c
   assandra/data/events_production/FitsByShip-g-6-Data.db sections=1
   progress=0/
   6918814 - 0%,
   /mnt/cassandra/data/events_production/FitsByShip-g-12-Data.db s
   ections=260 progress=0/9091780 - 0%], 4 sstables.
INFO [AntiEntropyStage:2] 2011-09-15 05:41:36,428
   StreamOutSession.java
   (lin
   e 174) Streaming to /10.34.90.8
   ERROR [Thread-56] 2011-09-15 05:41:38,515
   AbstractCassandraDaemon.java
   (line
   139) Fatal exception in thread Thread[Thread-56,5,main]
   java.lang.NullPointerException
   at
   org.apache.cassandra.net.IncomingTcpConnection.stream(IncomingTcpC
   onnection.java:174)
   at
   org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConn
   ection.java:114)
  
   Not sure if the exception is related to the outbound streaming
 above;
   other nodes are actively trying to stream to this node, so perhaps
 it
   comes
   from 

Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread mcasandra
This is a great new! Is it possible to do a write-up of main changes like
Leveldb and explain it a little bit. I get lost reading JIRA and sometimes
is difficult to follow the thread. It looks like there are some major
changes in this release.


--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/BETA-RELEASE-Apache-Cassandra-1-0-0-beta1-released-tp6797930p6798330.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread amulya rattan
Isn't this levelDB implementation for Google's LevelDB?
http://code.google.com/p/leveldb/
From what I know, its quite fast..

On Thu, Sep 15, 2011 at 4:04 PM, mcasandra mohitanch...@gmail.com wrote:

 This is a great new! Is it possible to do a write-up of main changes like
 Leveldb and explain it a little bit. I get lost reading JIRA and
 sometimes
 is difficult to follow the thread. It looks like there are some major
 changes in this release.


 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/BETA-RELEASE-Apache-Cassandra-1-0-0-beta1-released-tp6797930p6798330.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.



Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread Jeremiah Jordan
Is it possible to update an existing column family with 
{stable_compression: SnappyCompressor, 
compaction_strategy:LeveldCompactionStrategy}?  Or will I have to make a 
new column family and migrate my data to it?


-Jeremiah

On 09/15/2011 01:01 PM, Sylvain Lebresne wrote:

The Cassandra team is pleased to announce the release of the first beta for
the future Apache Cassandra 1.0.

Let me first stress that this is beta software and as such is *not* ready for
production use.

The goal of this release is to give a preview of what will be Cassandra 1.0
and more importantly to get wider testing before the final release. So please
help us make Cassandra 1.0 be the best it possibly could by testing this beta
release and reporting any problem you may encounter[3,4]. You can have a look
at the change log[1] and the release notes[2] to see where Cassandra 1.0
differs from the 0.8 series.

Apache Cassandra 1.0.0-beta1[5] is available as usual from the cassandra
website:

  http://cassandra.apache.org/download/

Thank you for your help in testing and have fun with it.

[1]: http://goo.gl/evCW0 (CHANGES.txt)
[2]: http://goo.gl/HbNsV (NEWS.txt)
[3]: https://issues.apache.org/jira/browse/CASSANDRA
[4]: user@cassandra.apache.org
[5]: https://svn.apache.org/repos/asf/cassandra/tags/cassandra-1.0.0-beta1


Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread Anand Somani
So I should be able to do rolling upgrade from 0.7 to 1.0 (not there in the
release notes, but I assume that is work in progress).

Thanks

On Thu, Sep 15, 2011 at 1:36 PM, amulya rattan talk2amu...@gmail.comwrote:

 Isn't this levelDB implementation for Google's LevelDB?
 http://code.google.com/p/leveldb/
 From what I know, its quite fast..


 On Thu, Sep 15, 2011 at 4:04 PM, mcasandra mohitanch...@gmail.com wrote:

 This is a great new! Is it possible to do a write-up of main changes like
 Leveldb and explain it a little bit. I get lost reading JIRA and
 sometimes
 is difficult to follow the thread. It looks like there are some major
 changes in this release.


 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/BETA-RELEASE-Apache-Cassandra-1-0-0-beta1-released-tp6797930p6798330.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.





Calling All Boston Cassandra Users

2011-09-15 Thread Chris Herron
Hi All,

Just a quick note to encourage those of you who are in the Greater Boston area 
to join:

1) The Boston subgroup of the Cassandra LinkedIn Group:
http://www.linkedin.com/groups?home=gid=3973913

2) The Boston Cassandra Meetup Group:
http://www.meetup.com/Boston-Cassandra-Users

The first meetup was held this week and it would be great to see more faces at 
the next one!

Cheers,

Chris

Re: Get CL ONE / NTS

2011-09-15 Thread aaron morton
 What I’m missing is a clear behavior for CL.ONE. I’m unsure about what nodes 
 are used by ONE and how the filtering of missing data/error is done. I’ve 
 landed in ReadCallback.java but error handling is out of my reach for the 
 moment.

Start with StorageProxy.fetch() to see which nodes are considered to be part of 
the request. ReadCallback.ctor() will decide which are actually involved based 
on the CL and RR been enabled.

At CL ONE there is no checkin of the replica responses for consistency, as 
there is only one response. If RR is enabled it will start from 
ReadCallback.maybeResolveForRepair(). 

Cheers



-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 15/09/2011, at 7:21 PM, Pierre Chalamet wrote:

 I do not agree here. I trade “consistency” (it’s more data miss than 
 consistency here) over performance in my case.
 I’m okay to handle the popping of the Spanish inquisition in the current DC 
 by triggering a new read with a stronger CL somewhere else (for example in 
 other DCs).
 If the data is nowhere to be found or nothing is reachable, well, it’s sad 
 but true but it will be the end of the game. Fine.
  
 What I’m missing is a clear behavior for CL.ONE. I’m unsure about what nodes 
 are used by ONE and how the filtering of missing data/error is done. I’ve 
 landed in ReadCallback.java but error handling is out of my reach for the 
 moment.
  
 Thanks,
 - Pierre
  
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Thursday, September 15, 2011 12:27 AM
 To: user@cassandra.apache.org
 Subject: Re: Get CL ONE / NTS
  
 Are you advising CL.ONE does not worth the game when considering
 read performance ?
 Consistency is not performance, it's a whole new thing to tune in your 
 application. If you have performance issues deal with those as performance 
 issues, better code / data model / hard ware. 
  
 By the way, I do not have consistency problem at all - data is only written
 once
 Nobody expects a consistency problem. It's chief weapon is surprise. Surprise 
 and fear. It's two weapons are fear and surprise. And so forth 
 http://www.youtube.com/watch?v=Ixgc_FGam3s
  
 If you write at LOCAL QUORUM in DC 1 and DC 2 is down at the start of the 
 request, a hint will be stored in DC 1. Some time later when DC 2 comes back 
 that hint will be sent to DC 2. If in the mean time you read from DC 2 at CL 
 ONE you will not get that change. With Read Repair enabled it will repair in 
 the background and you may get a different response on the next read (Am 
 guessing here, cannot remember exactly how RR works cross DC) 
  
  Cheers
  
  
  
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
  
 On 15/09/2011, at 10:07 AM, Pierre Chalamet wrote:
 
 
 Thanks Aaron, didn't seen your answer before mine.
 
 I do agree for 2/ I might have read error. Good suggestion to use
 EACH_QUORUM  - it could be a good trade off to read at this level if ONE
 fails.
 
 Maybe using LOCAL_QUORUM might be a good answer and will avoid headache
 after all. Are you advising CL.ONE does not worth the game when considering
 read performance ?
 
 By the way, I do not have consistency problem at all - data is only written
 once (and if more it is always the same data) and read several times across
 DC. I only have replication problems. That's why I'm more inclined to use
 CL.ONE for read if possible.
 
 Thanks,
 - Pierre
 
 
 -Original Message-
 From: aaron morton [mailto:aa...@thelastpickle.com] 
 Sent: Wednesday, September 14, 2011 11:48 PM
 To: user@cassandra.apache.org; pie...@chalamet.net
 Subject: Re: Get CL ONE / NTS
 
 Your current approach to Consistency opens the door to some inconsistent
 behavior. 
 
 
 1/ Will I have an error because DC2 does not have any copy of the data ?
 If you read from DC2 at CL ONE and the data is not replicated it will not be
 returned. 
 
 
 2/ Will Cassandra try to get the data from DC1 if nothing is found in DC2
 ?
 Not at CL ONE. If you used CL EACH QUORUM then the read will go to all the
 DC's. If DC2 is behind DC1 then you will get the data form DC1. 
 
 
 3/ In case of partial replication to DC2, will I see sometimes errors
 about servers not holding the data in DC2 ?
 Depending on the API call and the client, working at CL ONE, you will see
 either errors or missing data. 
 
 
 4/ Does Get CL ONE failed as soon as the fastest server to answer tell it
 does not have the data or does it waits until all servers tell they do not
 have the data ? 
 yes
 
 Consider 
 
 using LOCAL QUORUM for write and read, will make things a bit more
 consistent but not add inter DC overhead into the request latency. Still
 possible to not get data in DC2 if it is totally disconnected from the DC1 
 
 write at LOCAL QUORUM and read at EACH QUORUM . Will so you can always read,
 requests in DC2 will fail if DC1 is not reachable. 
 
 Hope that helps. 
 
 
 -

Re: Configuring multi DC cluster

2011-09-15 Thread aaron morton
Yes my bad.

http://wiki.apache.org/cassandra/Operations#Token_selection

Thanks

-
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 16/09/2011, at 6:50 AM, Anand Somani wrote:

 You are right, good catch, thanks!
 
 On Thu, Sep 15, 2011 at 8:28 AM, Konstantin Naryshkin konstant...@a-bb.net 
 wrote:
 Wait, his nodes are going SC, SC, AT, AT. Shouldn't they go SC, AT, SC, AT? 
 By which I mean that if he adds another node to the ring (or lowers the 
 replication factor), he will have a node that is under-utilized. The rings in 
 his data centers have the tokens:
 SC: 0, 1
 AT: 85070591730234615865843651857942052864, 
 85070591730234615865843651857942052865
 
 They should be:
 SC: 0, 85070591730234615865843651857942052864
 AT: 1, 85070591730234615865843651857942052865
 
 Or did I forget/misread something?
 
 - Original Message -
 From: aaron morton aa...@thelastpickle.com
 To: user@cassandra.apache.org
 Sent: Tuesday, September 13, 2011 6:19:16 PM
 Subject: Re: Configuring multi DC cluster
 
 Looks good to me. Last time I checked the Partitioner did not take the DC 
 into consideration https://issues.apache.org/jira/browse/CASSANDRA-3047
 
 
 Good luck.
 
 
 
 
 
 
 
 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 
 On 14/09/2011, at 8:41 AM, Anand Somani wrote:
 
 
 Hi,
 
 Just trying to setup a cluster of 4 nodes for multiDC scenario - with 2 nodes 
 in each DC. This is all on the same box just for testing the configuration 
 aspect. I have configured things as
 
• PropertyFile
 
 
• 127.0.0.4=SC:rack1 127.0.0.5=SC:rack2 127.0.0.6=AT:rack1 
 127.0.0.7=AT:rack2 # default for unknown nodes default=SC:rack1
•
 Setup initial tokens as - advised • configured keyspace with SC:2, AT:2
• ring looks like
 
 
• Address Status State Load Owns Token 
 85070591730234615865843651857942052865 127.0.0.4 Up Normal 464.98 KB 50.00% 0 
 127.0.0.5 Up Normal 464.98 KB 0.00% 1 127.0.0.6 Up Normal 464.99 KB 50.00% 
 85070591730234615865843651857942052864 127.0.0.7 Up Normal 464.99 KB 0.00% 
 85070591730234615865843651857942052865
 
 Is that what I should expect the ring to look like? Is there anything else I 
 should be testing/validating to make sure that things are configured 
 correctly for NTS?
 
 Thanks
 Anand
 
 



Re: LevelDB type compaction

2011-09-15 Thread Jonathan Ellis
On Thu, Sep 15, 2011 at 3:05 PM, mcasandra mohitanch...@gmail.com wrote:
 With Leveldb is it going to make reads slower

No.

Qualified: compared to major compaction under the tiered strategy,
leveled reads will usually be a little slower for update-heavy loads.
(For insert-mostly workloads compaction doesn't really matter.)  But
major compaction is not practical in production; you want something
that gives consistently good performance, rather than good performance
once a day or once a week and then degrades quickly.

 my understanding is it
 will create more smaller files

Yes.

 and updates could be scattered all over
 before compaction?

No, updates to a given row will be still be in a single sstable.

 Also, when memtables are flushed, does it create small
 files too since memtable size would generally be bigger than 2-4MB?

Level0 (newly flushed) sstables are not size-limited.  This is one of
a handful of differences we have over leveldb itself, which remains a
good overview (http://leveldb.googlecode.com/svn/trunk/doc/impl.html).

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread Jonathan Ellis
You should be able to update it, which will leave existing sstables
untouched but new ones will be generated compressed.  (You could issue
scrub to rewrite the existing ones compressed too, if you wanted to
force that.)

On Thu, Sep 15, 2011 at 3:44 PM, Jeremiah Jordan
jeremiah.jor...@morningstar.com wrote:
 Is it possible to update an existing column family with {stable_compression:
 SnappyCompressor, compaction_strategy:LeveldCompactionStrategy}?  Or will I
 have to make a new column family and migrate my data to it?

 -Jeremiah

 On 09/15/2011 01:01 PM, Sylvain Lebresne wrote:

 The Cassandra team is pleased to announce the release of the first beta
 for
 the future Apache Cassandra 1.0.

 Let me first stress that this is beta software and as such is *not* ready
 for
 production use.

 The goal of this release is to give a preview of what will be Cassandra
 1.0
 and more importantly to get wider testing before the final release. So
 please
 help us make Cassandra 1.0 be the best it possibly could by testing this
 beta
 release and reporting any problem you may encounter[3,4]. You can have a
 look
 at the change log[1] and the release notes[2] to see where Cassandra 1.0
 differs from the 0.8 series.

 Apache Cassandra 1.0.0-beta1[5] is available as usual from the cassandra
 website:

  http://cassandra.apache.org/download/

 Thank you for your help in testing and have fun with it.

 [1]: http://goo.gl/evCW0 (CHANGES.txt)
 [2]: http://goo.gl/HbNsV (NEWS.txt)
 [3]: https://issues.apache.org/jira/browse/CASSANDRA
 [4]: user@cassandra.apache.org
 [5]: https://svn.apache.org/repos/asf/cassandra/tags/cassandra-1.0.0-beta1




-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: [BETA RELEASE] Apache Cassandra 1.0.0-beta1 released

2011-09-15 Thread Jonathan Ellis
Should, although we've only tested 0.8-to-1.0 directly.  That would be
a useful report to contribute!

On Thu, Sep 15, 2011 at 3:45 PM, Anand Somani meatfor...@gmail.com wrote:
 So I should be able to do rolling upgrade from 0.7 to 1.0 (not there in the
 release notes, but I assume that is work in progress).

 Thanks

 On Thu, Sep 15, 2011 at 1:36 PM, amulya rattan talk2amu...@gmail.com
 wrote:

 Isn't this levelDB implementation for Google's
 LevelDB? http://code.google.com/p/leveldb/
 From what I know, its quite fast..

 On Thu, Sep 15, 2011 at 4:04 PM, mcasandra mohitanch...@gmail.com wrote:

 This is a great new! Is it possible to do a write-up of main changes like
 Leveldb and explain it a little bit. I get lost reading JIRA and
 sometimes
 is difficult to follow the thread. It looks like there are some major
 changes in this release.


 --
 View this message in context:
 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/BETA-RELEASE-Apache-Cassandra-1-0-0-beta1-released-tp6797930p6798330.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.






-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com


Re: Cassandra -f problem

2011-09-15 Thread Hernán Quevedo
Thanks a lot. The problem was that every terminal I open on Debian 6 lacks
of java home and path; I have to export them every time I start the virtual
machine; btw I have Debian and cassandra running inside vmware workstation.

Thanks again. I'm following the readme file.

On Mon, Sep 12, 2011 at 11:37 PM, Roshan Dawrani roshandawr...@gmail.comwrote:

 Hi,

 Do you have JAVA_HOME exported? If not, can you export it and retry?

 Cheers.

 On Tue, Sep 13, 2011 at 8:59 AM, Hernán Quevedo 
 alexandros.c@gmail.com wrote:

 Hi, Roshan. This is great support, amazing support; not used to it :)

 Thanks for the reply.

 Well I think java is installed correctly, I mean, the java -version
 command works on a terminal, so the PATH env variable is correctly set,
 right?
 I downloaded the JDK7 and put it on opt/java/ and then set the path.


 But, the eclipse icon says it can't find any JRE or JDK, which is weird
 because of what I said above... but... but... what else could it be?

 Thanks!


 On Sun, Sep 11, 2011 at 10:05 PM, Roshan Dawrani roshandawr...@gmail.com
  wrote:

 Hi,

 Cassandra starts JVM as $JAVA -ea -cp $CLASSPATH

 Looks like $JAVA is coming is empty in your case, hence the error exec
 -ea not found.

 Do you not have java installed? Please install it and set JAVA_HOME
 appropriately and retry.

 Cheers.


 On Mon, Sep 12, 2011 at 8:23 AM, Hernán Quevedo 
 alexandros.c@gmail.com wrote:

 Hi, all.

 I´m new at this and haven´t been able to install cassandra in debian
 6. After uncompressing the tar and creating var/log and var/lib
 directories, the command bin/cassandra -f results in message exec:
 357 -ea not found preventing cassandra from run the process README
 file says it is suppose to start.

 Any help would be very appreciated.

 Thnx!




 --
 Roshan
 Blog: http://roshandawrani.wordpress.com/
 Twitter: @roshandawrani http://twitter.com/roshandawrani
 Skype: roshandawrani




 --
 Είναι η θέληση των Θεών.




 --
 Roshan
 Blog: http://roshandawrani.wordpress.com/
 Twitter: @roshandawrani http://twitter.com/roshandawrani
 Skype: roshandawrani




-- 
Είναι η θέληση των Θεών.