Re: Unable to remove dead node from cluster.

2015-09-25 Thread Jeff Jirsa
Apparently this was reported back in May: 
https://issues.apache.org/jira/browse/CASSANDRA-9510

- Jeff

From:  Dikang Gu
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, September 25, 2015 at 11:31 AM
To:  cassandra
Subject:  Re: Unable to remove dead node from cluster.

The NPE throws when node tried to handleStateLeft, because it can not find the 
tokens associated with the node, can we just ignore the NPE and continue to 
remove the endpoint from the ring?

On Fri, Sep 25, 2015 at 10:52 AM, Dikang Gu  wrote:
@Jeff, yeah, I run the nodetool grep, and in my case, some nodes return "301", 
and some nodes return "300". And 300 is the correct number of nodes in my 
cluster. 

So it does look like an inconsistent issue, can you open a jira for this? Also, 
I'm looking for a quick fix/patch for this.

On Fri, Sep 25, 2015 at 7:43 AM, Nate McCall  wrote:
A few other folks have reported issues with lingering dead nodes on large 
clusters - Jason Brown *just* gave an excellent gossip presentation at the 
summit regarding gossip optimizations for large clusters. 

Gossip is in the process of being refactored (here's at least one of the 
issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it would be 
worth opening an issue with as much information as you can provide to, at the 
very least, have information avaiable for others. 

On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa  wrote:
The stack trace is one similar to one I recall seeing recently, but don’t have 
in front of me. This is an outside chance that is not at all certain to be the 
case.

For EACH of the hundreds of nodes in your cluster, I suggest you run 

nodetool status | egrep “(^UN|^DN)" | wc -l 

and count to see if every node really has every other node in its ring 
properly. 

I suspect, but am not at all sure, that you have inconsistencies you’re not yet 
aware of (for example, if you expect that you have 100 nodes in the cluster, 
I’m betting that the query above returns 99 on at least one of the nodes).  If 
this is the case, please reply so that you and I can submit a Jira and compare 
our stack traces and we can find the underlying root cause of this together. 

- Jeff

From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Thursday, September 24, 2015 at 9:10 PM
To: cassandra 

Subject: Re: Unable to remove dead node from cluster.

@Jeff, I just use jmx connect to one node, run the unsafeAssainateEndpoint, and 
pass in the "10.210.165.55" ip address.

Yes, we have hundreds of other nodes in the nodetool status output as well.

On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa  wrote:
When you run unsafeAssassinateEndpoint, to which host are you connected, and 
what argument are you passing?

Are there other nodes in the ring that you’re not including in the ‘nodetool 
status’ output?


From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 10:09 PM
To: cassandra
Cc: "d...@cassandra.apache.org"
Subject: Re: Unable to remove dead node from cluster.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu  wrote:
I have tried all of them, neither of them worked. 
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we fix 
it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez 
 wrote:

Order is decommission, remove, assassinate.

Which have you tried?

On Sep 21, 2015 10:47 AM, "Dikang Gu"  wrote:
Hi there, 

I have a dead node in our cluster, which is a wired state right now, and can 
not be removed from cluster.

The nodestatus shows:
Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens  OwnsHost ID 
  Rack
DN  10.210.165.55?  256 ?   null
  r1

I tried the unsafeAssassinateEndpoint, but got exception like:
2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is now DOWN
2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread 
Thread[GossipStage:1,5,main]
2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
2015-09-18_23:21:40.80669   at 
org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80669   at 
org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80670   at 
org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671   at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1

Re: Running Cassandra on Java 8 u60..

2015-09-25 Thread Jeff Jirsa
We saw no problems with 8u60.


From:   on behalf of Kevin Burton
Reply-To:  "user@cassandra.apache.org"
Date:  Friday, September 25, 2015 at 5:08 PM
To:  "user@cassandra.apache.org"
Subject:  Running Cassandra on Java 8 u60..

Any issues with running Cassandra 2.0.16 on Java 8? I remember there is long 
term advice on not changing the GC but not the underlying version of Java. 

Thoughts?

-- 
We’re hiring if you know of any awesome Java Devops or Linux Operations 
Engineers!

Founder/CEO Spinn3r.com
Location: San Francisco, CA
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




smime.p7s
Description: S/MIME cryptographic signature


Re: Running Cassandra on Java 8 u60..

2015-09-25 Thread Stefano Ortolani
I think those were referring to Java7 and G1GC (early versions were buggy).

Cheers,
Stefano


On Fri, Sep 25, 2015 at 5:08 PM, Kevin Burton  wrote:

> Any issues with running Cassandra 2.0.16 on Java 8? I remember there is
> long term advice on not changing the GC but not the underlying version of
> Java.
>
> Thoughts?
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Running Cassandra on Java 8 u60..

2015-09-25 Thread Kevin Burton
Any issues with running Cassandra 2.0.16 on Java 8? I remember there is
long term advice on not changing the GC but not the underlying version of
Java.

Thoughts?

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: memory usage problem of Metadata.tokenMap.tokenToHost

2015-09-25 Thread Robert Coli
On Sun, Sep 20, 2015 at 9:22 AM, joseph gao  wrote:

>My application uses 2000+ keyspaces, and will dynamically create
> keyspaces and tables.
>

While I agree with your observation (and think you should file a ticket at
issues.apache.org and let the list know the URL) that there is low hanging
optimization fruit here...

... dynamically creating 2000+ keyspaces and associated tables sounds like
setting yourself up for a world of hurt. I would not be excited about the
prospect.

=Rob


Re: How to tune Cassandra or Java Driver to get lower latency when there are a lot of writes?

2015-09-25 Thread Benyi Wang
Hi Ryan,

As I said, saveToCassandra doesn't support "DELETE". This is why I modified
the code of spark-cassandra-connector to allow me have DELETEs. What I
change is how to bind a RDD row into a batch of CQL preparedStatements.



On Fri, Sep 25, 2015 at 7:22 AM, Ryan Svihla  wrote:

> Why aren’t you using saveToCassandra (
> https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md)?
> They have a number of locality aware optimizations that will probably
> exceed your by hand bulk loading (especially if you’re not doing it inside
> something like foreach partition).
>
> Also you can easily tune up and down the size of those tasks and therefore
> batches to minimize harm on the prod system.
>
> On Sep 24, 2015, at 5:37 PM, Benyi Wang  wrote:
>
> I use Spark and spark-cassandra-connector with a customized Cassandra
> writer (spark-cassandra-connector doesn’t support DELETE). Basically the
> writer works as follows:
>
>- Bind a row in Spark RDD with either INSERT/Delete PreparedStatement
>- Create a BatchStatement for multiple rows
>- Write to Cassandra.
>
> I knew using CQLBulkOutputFormat would be better, but it doesn't supports
> DELETE.
> ​
>
> On Thu, Sep 24, 2015 at 1:27 PM, Gerard Maas 
> wrote:
>
>> How are you loading the data? I mean, what insert method are you using?
>>
>> On Thu, Sep 24, 2015 at 9:58 PM, Benyi Wang 
>> wrote:
>>
>>> I have a cassandra cluster provides data to a web service. And there is
>>> a daily batch load writing data into the cluster.
>>>
>>>- Without the batch loading, the service’s Latency 99thPercentile is
>>>3ms. But during the load, it jumps to 90ms.
>>>- I checked cassandra keyspace’s ReadLatency.99thPercentile, which
>>>jumps to 1ms from 600 microsec.
>>>- The service’s cassandra java driver request 99thPercentile was
>>>90ms during the load
>>>
>>> The java driver took the most time. I knew the Cassandra servers are
>>> busy in writing, but I want to know what kinds of metrics can identify
>>> where is the bottleneck so that I can tune it.
>>>
>>> I’m using Cassandra 2.1.8 and Cassandra Java Driver 2.1.5.
>>> ​
>>>
>>
>>
>
> Regards,
>
> Ryan Svihla
>
>


Re: High read latency

2015-09-25 Thread Jaydeep Chovatia
Please find histogram attached.

On Fri, Sep 25, 2015 at 12:20 PM, Ryan Svihla  wrote:

> if everything is in ram there could be a number of issues unrelated to
> Cassandra and there could be hardware limitations or contention problems.
> Otherwise cell count can really deeply impact reads, all ram or not, and
> some of this is because of the nature of GC and some of it is the age of
> the sstable format (which is due to be revamped in 3.0). Also partition
> size can matter just because of physics, if one of those is a 1gb
> partition, the network interface can only move that back across the wire so
> quickly not to mention the GC issues you’d run into.
>
> Anyway this is why I asked for the histograms, I wanted to get cell count
> and partition size. I’ve seen otherwise very stout hardware get slow on
> reads of large results because either a bottleneck was hit somewhere, or
> the CPU got slammed with GC, or other processes running on the machine were
> contending with Cassandra.
>
>
> On Sep 25, 2015, at 12:45 PM, Jaydeep Chovatia 
> wrote:
>
> I understand that but everything is in RAM (my data dir is tmpfs) and my
> row is not that wide approx. less than 5MB in size. So my question is if
> everything is in RAM then why does it take 43ms latency?
>
> On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla  wrote:
>
>> if you run:
>>
>> nodetool cfhistograms  
>>
>> On the given table and that will tell you how wide your rows are getting.
>> At some point you can get wide enough rows that just the physics of
>> retrieving them all take some time.
>>
>>
>> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>> Jaydeep; since your primary key involves a clustering column, you may be
>> having pretty wide rows. The read would be sequential. The latency could be
>> acceptable, if the read were to involve really wide rows.
>>
>> If your primary key was like ((a,b)) without the clustering column, it's
>> like reading a key value pair, and 40ms latency may have been a concern.
>>
>> Bottom Line : The latency depends on how wide the row is.
>>
>> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
>> pskraj...@gmail.com> wrote:
>>
>>> thanks for the information. Posting the query too would be of help.
>>>
>>> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
>>> chovatia.jayd...@gmail.com> wrote:
>>>
 Please find required details here:

 -  Number of req/s

 2k reads/s

 -  Schema details

 create table test {

 a timeuuid,

 b bigint,

 c int,

 d int static,

 e int static,

 f int static,

 g int static,

 h int,

 i text,

 j text,

 k text,

 l text,

 m set

 n bigint

 o bigint

 p bigint

 q bigint

 r int

 s text

 t bigint

 u text

 v text

 w text

 x bigint

 y bigint

 z bigint,

 primary key ((a, b), c)

 };

 -  JVM settings about the heap

 Default settings

 -  Execution time of the GC

 Avg. 400ms. I do not see long pauses of GC anywhere in the log file.

 On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric 
 wrote:

> Hi,
>
>
>
>
>
> Before speaking about tuning, can you provide some additional
> information ?
>
>
>
> -  Number of req/s
>
> -  Schema details
>
> -  JVM settings about the heap
>
> -  Execution time of the GC
>
>
>
> 43ms for a read latency may be acceptable according to the number of
> request per second.
>
>
>
>
>
> Eric
>
>
>
> *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
> *Envoyé :* mardi 22 septembre 2015 00:07
> *À :* user@cassandra.apache.org
> *Objet :* High read latency
>
>
>
> Hi,
>
>
>
> My application issues more read requests than write, I do see that
> under load cfstats for one of the table is quite high around 43ms
>
>
>
> Local read count: 114479357
>
> Local read latency: 43.442 ms
>
> Local write count: 22288868
>
> Local write latency: 0.609 ms
>
>
>
>
>
> Here is my node configuration:
>
> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB
> of data on each node (and for experiment purpose I stored data in tmpfs)
>
>
>
> I've tried increasing concurrent_read count upto 512 but no help in
> read latency. CPU/Memory/IO looks fine on system.
>
>
>
> Any idea what should I tune?
>
>
>
> Jaydeep
>
> --
>
> Ce message et le

Re: To batch or not to batch: A question for fast inserts

2015-09-25 Thread Eric Stevens
Yep, my approach is definitely naive to hotspotting.  If someone had that
trouble, they could exhaust the iterator out of getReplicas() and
distribute their writes more evenly (which might result in better statement
distribution, but wouldn't change the workload on the cluster).  In the end
they're going to get in trouble with hotspotting regardless of async single
statements or batches.  The single statement async code prefers the first
replica returned, so this logic is consistent with the default model.

> Lots of folks are still stuck on maximum utilization, ironically these
same people tend to focus on using spindles for storage and so will
ultimately end up having to throttle ingest to allow compaction to catch up

Yeah, otherwise known cost sensitivity, with the unfortunate side effect of
making it easy to accidentally overwhelm a cluster as a new operator since
the warning signs look different than they do for most other data stores.

Straying a bit far afield here, but I actually think it would be a nice
feature if *by default* Cassandra artificially throttled writes as
compaction starts getting behind as an early warning sign (a feature you
could turn off with a flag).  Cassandra does a great job of absorbing
bursty writes, but unfortunately that masks (for the new operator) the
warning signs that your sustained write rate is more than the cluster can
handle.  Writes are still fast so you assume the cluster is healthy, and by
the time there's backpressure to the client, you're already possibly past
the point of simple recovery (eg you no longer have enough excess IO to
support bootstrapping new nodes).  That would also actually free up some
I/O to keep the cluster from tipping over so hard.

On Fri, Sep 25, 2015 at 12:14 PM, Ryan Svihla  wrote:

>
> I think my main point is still, unlogged token aware batches are great,
> but if you’re writes are large enough, they may actually hurt rather than
> help, and likewise if your writes are too small, async only is likely only
> going to hurt. I’d say the average user I’ve had to help (with my selection
> bias) has individual writes already on the large size of optimal so
> batching frequently hurts them. Also they tend not to do async in the first
> place.
>
> In summary, batch or not is IMO the wrong area to focus, total write
> payload sizing for your cluster is the factor to focus on and however you
> get there is fantastic. more replies inline:
>
> On Sep 25, 2015, at 1:24 PM, Eric Stevens  wrote:
>
> > compaction usually is the limiter for most clusters, so the difference
> between async versus unlogged batch ends up being minor or worse..non
> existent cause the hardware and data model combination result in compaction
> being the main throttle.
>
> If your number of records to load per second is predetermined (as would be
> the case in any production use case), then this doesn't make any difference
> on compaction whether loaded as batches vs as single statements, your
> cluster needs to support the same number and shape of mutates either way.
>
>
> Not everyone is as grown up about their cluster sizing. Lots of folks are
> still stuck on maximum utilization, ironically these same people tend to
> focus on using spindles for storage and so will ultimately end up having to
> throttle ingest to allow compaction to catch up. Anyway in these admittedly
> awful situations throttling of ingest is all too common as the commit log
> can basically easily outstrip compaction.
>
>
> > if you add in token awareness to your batch..you’ve basically
> eliminated the primary complaint of using unlogged batches so why not do
> that.
>
> This is absolutely the right idea if your driver supports it, but the gain
> is smaller than I would have expected based on the warnings
> of imminent doom when we've had this conversation before.  If your driver
> supports token awareness, use that to group statements by primary replica
> and concurrently execute those that way.  Here's the code we're using (in
> Scala using the Java driver):
>
> def groupByFirstReplica()(implicit session: CQLSession): Map[Host, CQLBatch] 
> = {
>   val meta = session.getCluster.getMetadata
>   statements.groupBy { st =>
> try {
>   meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
> } catch { case NonFatal(e) =>
>   null
> }
>   } mapValues { st => CQLBatch(st) }
> }
>
> We now have a map of primary host to sub-batch for all the statements in
> our logical batch.  We can now do either of these (depending on how greedy
> we want to be in our client; Future.traverse is preferred and nicer,
> Future.sequence is greedier and more resource intensive):
>
> Future.sequence(groupByFirstReplica().values.map(_.execute())).map(_.flatten)
> Future.traverse(groupByFirstReplica().values) { _.execute() }.map(_.flatten)
>
> We get back Future[Iterable[ResultSet]] - this future completes when the
> logical batch's sub-batches have all completed.
>
> Note that with the DSE Jav

Re: High read latency

2015-09-25 Thread Ryan Svihla
if everything is in ram there could be a number of issues unrelated to 
Cassandra and there could be hardware limitations or contention problems. 
Otherwise cell count can really deeply impact reads, all ram or not, and some 
of this is because of the nature of GC and some of it is the age of the sstable 
format (which is due to be revamped in 3.0). Also partition size can matter 
just because of physics, if one of those is a 1gb partition, the network 
interface can only move that back across the wire so quickly not to mention the 
GC issues you’d run into.

Anyway this is why I asked for the histograms, I wanted to get cell count and 
partition size. I’ve seen otherwise very stout hardware get slow on reads of 
large results because either a bottleneck was hit somewhere, or the CPU got 
slammed with GC, or other processes running on the machine were contending with 
Cassandra.

> On Sep 25, 2015, at 12:45 PM, Jaydeep Chovatia  
> wrote:
> 
> I understand that but everything is in RAM (my data dir is tmpfs) and my row 
> is not that wide approx. less than 5MB in size. So my question is if 
> everything is in RAM then why does it take 43ms latency? 
> 
> On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla  > wrote:
> if you run:
> 
> nodetool cfhistograms  
> 
> On the given table and that will tell you how wide your rows are getting. At 
> some point you can get wide enough rows that just the physics of retrieving 
> them all take some time. 
> 
> 
>> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi > > wrote:
>> 
>> Jaydeep; since your primary key involves a clustering column, you may be 
>> having pretty wide rows. The read would be sequential. The latency could be 
>> acceptable, if the read were to involve really wide rows.
>> 
>> If your primary key was like ((a,b)) without the clustering column, it's 
>> like reading a key value pair, and 40ms latency may have been a concern. 
>> 
>> Bottom Line : The latency depends on how wide the row is.
>> 
>> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi 
>> mailto:pskraj...@gmail.com>> wrote:
>> thanks for the information. Posting the query too would be of help.
>> 
>> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia 
>> mailto:chovatia.jayd...@gmail.com>> wrote:
>> Please find required details here:
>> 
>> -  Number of req/s
>> 
>> 2k reads/s
>> 
>> -  Schema details
>> 
>> create table test {
>> 
>> a timeuuid,
>> 
>> b bigint,
>> 
>> c int,
>> 
>> d int static,
>> 
>> e int static,
>> 
>> f int static,
>> 
>> g int static,
>> 
>> h int,
>> 
>> i text,
>> 
>> j text,
>> 
>> k text,
>> 
>> l text,
>> 
>> m set
>> 
>> n bigint
>> 
>> o bigint
>> 
>> p bigint
>> 
>> q bigint
>> 
>> r int
>> 
>> s text
>> 
>> t bigint
>> 
>> u text
>> 
>> v text
>> 
>> w text
>> 
>> x bigint
>> 
>> y bigint
>> 
>> z bigint,
>> 
>> primary key ((a, b), c)
>> 
>> };
>> 
>> -  JVM settings about the heap
>> 
>> Default settings
>> 
>> -  Execution time of the GC
>> 
>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>> 
>> 
>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric > > wrote:
>> Hi,
>> 
>>  
>> 
>>  
>> 
>> Before speaking about tuning, can you provide some additional information ?
>> 
>>  
>> 
>> -  Number of req/s
>> 
>> -  Schema details
>> 
>> -  JVM settings about the heap
>> 
>> -  Execution time of the GC
>> 
>>  
>> 
>> 43ms for a read latency may be acceptable according to the number of request 
>> per second.
>> 
>>  
>> 
>>  
>> 
>> Eric
>> 
>>  
>> 
>> De : Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com 
>> ] 
>> Envoyé : mardi 22 septembre 2015 00:07
>> À : user@cassandra.apache.org 
>> Objet : High read latency
>> 
>>  
>> 
>> Hi,
>> 
>>  
>> 
>> My application issues more read requests than write, I do see that under 
>> load cfstats for one of the table is quite high around 43ms
>> 
>>  
>> 
>> Local read count: 114479357
>> 
>> Local read latency: 43.442 ms
>> 
>> Local write count: 22288868
>> 
>> Local write latency: 0.609 ms
>> 
>>  
>> 
>>  
>> 
>> Here is my node configuration:
>> 
>> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB of 
>> data on each node (and for experiment purpose I stored data in tmpfs)
>> 
>>  
>> 
>> I've tried increasing concurrent_read count upto 512 but no help in read 
>> latency. CPU/Memory/IO looks fine on system.
>> 
>>  
>> 
>> Any idea what should I tune?
>> 
>>  
>> 
>> Jaydeep
>> 
>> 
>> 
>> Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
>> exclusif de ses destinataires. Il peut également être protégé par le secret 
>> professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
>> immédiatement l'expéditeur et de le détruire. L'intégrité du m

Re: To batch or not to batch: A question for fast inserts

2015-09-25 Thread Ryan Svihla

I think my main point is still, unlogged token aware batches are great, but if 
you’re writes are large enough, they may actually hurt rather than help, and 
likewise if your writes are too small, async only is likely only going to hurt. 
I’d say the average user I’ve had to help (with my selection bias) has 
individual writes already on the large size of optimal so batching frequently 
hurts them. Also they tend not to do async in the first place.

In summary, batch or not is IMO the wrong area to focus, total write payload 
sizing for your cluster is the factor to focus on and however you get there is 
fantastic. more replies inline:

> On Sep 25, 2015, at 1:24 PM, Eric Stevens  wrote:
> 
> > compaction usually is the limiter for most clusters, so the difference 
> > between async versus unlogged batch ends up being minor or worse..non 
> > existent cause the hardware and data model combination result in compaction 
> > being the main throttle.
> 
> If your number of records to load per second is predetermined (as would be 
> the case in any production use case), then this doesn't make any difference 
> on compaction whether loaded as batches vs as single statements, your cluster 
> needs to support the same number and shape of mutates either way.

Not everyone is as grown up about their cluster sizing. Lots of folks are still 
stuck on maximum utilization, ironically these same people tend to focus on 
using spindles for storage and so will ultimately end up having to throttle 
ingest to allow compaction to catch up. Anyway in these admittedly awful 
situations throttling of ingest is all too common as the commit log can 
basically easily outstrip compaction. 

> 
> > if you add in token awareness to your batch..you’ve basically eliminated 
> > the primary complaint of using unlogged batches so why not do that. 
> 
> This is absolutely the right idea if your driver supports it, but the gain is 
> smaller than I would have expected based on the warnings of imminent doom 
> when we've had this conversation before.  If your driver supports token 
> awareness, use that to group statements by primary replica and concurrently 
> execute those that way.  Here's the code we're using (in Scala using the Java 
> driver):
> def groupByFirstReplica()(implicit session: CQLSession): Map[Host, CQLBatch] 
> = {
>   val meta = session.getCluster.getMetadata
>   statements.groupBy { st =>
> try {
>   meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
> } catch { case NonFatal(e) =>
>   null
> }
>   } mapValues { st => CQLBatch(st) }
> }
> We now have a map of primary host to sub-batch for all the statements in our 
> logical batch.  We can now do either of these (depending on how greedy we 
> want to be in our client; Future.traverse is preferred and nicer, 
> Future.sequence is greedier and more resource intensive):
> Future.sequence(groupByFirstReplica().values.map(_.execute())).map(_.flatten)
> Future.traverse(groupByFirstReplica().values) { _.execute() }.map(_.flatten)
> We get back Future[Iterable[ResultSet]] - this future completes when the 
> logical batch's sub-batches have all completed.
> 
> Note that with the DSE Java driver, for the above to succeed in its intent, 
> the statements need to be prepared statements (for st.getRoutingKey to return 
> non-null), and either the keyspace has to be fully defined in the CQL, or you 
> have to have set the correct keyspace when you created the connection (for 
> st.getKeyspace to return non-null).  Otherwise the values given to 
> meta.getReplicas will fail to resolve a primary host which results in doing 
> token-unaware batches (i.e. you'll get back a Map(null -> allStatements)).  
> However those same criteria are required for single statements to be token 
> aware.
> 

This is excellent stuff, my only concern with primary replicas is for people 
with uneven partitions, and the occasionally stupidly fat one. I’d rather 
spread those writes around the other replicas instead of beating up the primary 
one. However, for a well modeled partition key the approach you outline is 
probably optimal.

> 
> 
> 
> On Fri, Sep 25, 2015 at 7:30 AM, Ryan Svihla  > wrote:
> Generally this is all correct but I cannot emphasize enough how much this 
> “just depends” and today I generally move people to async inserts first 
> before trying to micro-optimize some things to keep in mind.
> 
> compaction usually is the limiter for most clusters, so the difference 
> between async versus unlogged batch ends up being minor or worse..non 
> existent cause the hardware and data model combination result in compaction 
> being the main throttle.
> if you add in token awareness to your batch..you’ve basically eliminated the 
> primary complaint of using unlogged batches so why not do that. When I was at 
> DataStax I made some similar suggestions for token aware batch after seeing 
> the perf improvements with Spark writes using unl

How to remove huge files with all expired data sooner?

2015-09-25 Thread Dongfeng Lu
Hi I have a table where I set TTL to only 7 days for all records and we keep 
pumping records in every day. In general, I would expect all data files for 
that table to have timestamps less than, say 8 or 9 days old, giving the system 
some time to work its magic. However, I see some files more than 9 days old 
occationally. Last Friday, I saw 4 large files, each about 10G in size, with 
timestamps about 5, 4, 3, 2 weeks old. Interestingly they are all gone this 
Monday, leaving 1 new file 9 GB in size.

The compaction strategy is SizeTieredCompactionStrategy, and I can understand 
why the above happened. It seems we have 10G of data every week and when 
SizeTieredCompactionStrategy works to create various tiers, it just happened 
the file size for the next tier is 10G, and all the data is packed into this 
huge file. Then it starts the next cycle. Another week goes by, and another 10G 
file is created. This process continues until the minimum number of files of 
the same size is reached, which I think is 4 by default. Then it started to 
compact this set of 4 10G files. At this time, all data in these 4 files have 
expired so we end up with nothing or much smaller file if there is still some 
records with TTL left.

I have many tables like this, and I'd like to reclaim those spaces sooner. What 
would be the best way to do it? Should I run "nodetool compact" when I see two 
large files that are 2 weeks old? Is there configuration parameters I can tune 
to achieve the same effect? I looked through all the CQL Compaction 
Subproperties for STCS, but I am not sure how they can help here. Any 
suggestion is welcome.

BTW, I am using Cassandra 2.0.6.


Re: Unable to remove dead node from cluster.

2015-09-25 Thread Dikang Gu
The NPE throws when node tried to handleStateLeft, because it can not find
the tokens associated with the node, can we just ignore the NPE and
continue to remove the endpoint from the ring?

On Fri, Sep 25, 2015 at 10:52 AM, Dikang Gu  wrote:

> @Jeff, yeah, I run the nodetool grep, and in my case, some nodes return
> "301", and some nodes return "300". And 300 is the correct number of nodes
> in my cluster.
>
> So it does look like an inconsistent issue, can you open a jira for this?
> Also, I'm looking for a quick fix/patch for this.
>
> On Fri, Sep 25, 2015 at 7:43 AM, Nate McCall 
> wrote:
>
>> A few other folks have reported issues with lingering dead nodes on large
>> clusters - Jason Brown *just* gave an excellent gossip presentation at the
>> summit regarding gossip optimizations for large clusters.
>>
>> Gossip is in the process of being refactored (here's at least one of the
>> issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it
>> would be worth opening an issue with as much information as you can provide
>> to, at the very least, have information avaiable for others.
>>
>> On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa 
>> wrote:
>>
>>> The stack trace is one similar to one I recall seeing recently, but
>>> don’t have in front of me. This is an outside chance that is not at all
>>> certain to be the case.
>>>
>>> For EACH of the hundreds of nodes in your cluster, I suggest you run
>>>
>>> nodetool status | egrep “(^UN|^DN)" | wc -l
>>>
>>> and count to see if every node really has every other node in its ring
>>> properly.
>>>
>>> I suspect, but am not at all sure, that you have inconsistencies you’re
>>> not yet aware of (for example, if you expect that you have 100 nodes in the
>>> cluster, I’m betting that the query above returns 99 on at least one of the
>>> nodes).  If this is the case, please reply so that you and I can submit a
>>> Jira and compare our stack traces and we can find the underlying root cause
>>> of this together.
>>>
>>> - Jeff
>>>
>>> From: Dikang Gu
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Thursday, September 24, 2015 at 9:10 PM
>>> To: cassandra
>>>
>>> Subject: Re: Unable to remove dead node from cluster.
>>>
>>> @Jeff, I just use jmx connect to one node, run the
>>> unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.
>>>
>>> Yes, we have hundreds of other nodes in the nodetool status output as
>>> well.
>>>
>>> On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa >> > wrote:
>>>
 When you run unsafeAssassinateEndpoint, to which host are you
 connected, and what argument are you passing?

 Are there other nodes in the ring that you’re not including in the
 ‘nodetool status’ output?


 From: Dikang Gu
 Reply-To: "user@cassandra.apache.org"
 Date: Tuesday, September 22, 2015 at 10:09 PM
 To: cassandra
 Cc: "d...@cassandra.apache.org"
 Subject: Re: Unable to remove dead node from cluster.

 ping.

 On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu  wrote:

> I have tried all of them, neither of them worked.
> 1. decommission: the host had hardware issue, and I can not connect to
> it.
> 2. remove, there is not HostID, so the removenode did not work.
> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before,
> can we fix it?
>
> Thanks
> Dikang.
>
> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
> sebastian.este...@datastax.com> wrote:
>
>> Order is decommission, remove, assassinate.
>>
>> Which have you tried?
>> On Sep 21, 2015 10:47 AM, "Dikang Gu"  wrote:
>>
>>> Hi there,
>>>
>>> I have a dead node in our cluster, which is a wired state right now,
>>> and can not be removed from cluster.
>>>
>>> The nodestatus shows:
>>> Datacenter: DC1
>>> ===
>>> Status=Up/Down
>>> |/ State=Normal/Leaving/Joining/Moving
>>> --  Address  Load   Tokens  OwnsHost
>>> ID   Rack
>>> DN  10.210.165.55?  256 ?   null
>>>  r1
>>>
>>> I tried the unsafeAssassinateEndpoint, but got exception like:
>>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55
>>> is now DOWN
>>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>>> Thread[GossipStage:1,5,main]
>>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>>> 2015-09-18_23:21:40.80669   at
>>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>>> 2015-09-18_23:21:40.80669   at
>>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>

Re: To batch or not to batch: A question for fast inserts

2015-09-25 Thread Eric Stevens
> compaction usually is the limiter for most clusters, so the difference
between async versus unlogged batch ends up being minor or worse..non
existent cause the hardware and data model combination result in compaction
being the main throttle.

If your number of records to load per second is predetermined (as would be
the case in any production use case), then this doesn't make any difference
on compaction whether loaded as batches vs as single statements, your
cluster needs to support the same number and shape of mutates either way.

> if you add in token awareness to your batch..you’ve basically eliminated
the primary complaint of using unlogged batches so why not do that.

This is absolutely the right idea if your driver supports it, but the gain
is smaller than I would have expected based on the warnings
of imminent doom when we've had this conversation before.  If your driver
supports token awareness, use that to group statements by primary replica
and concurrently execute those that way.  Here's the code we're using (in
Scala using the Java driver):

def groupByFirstReplica()(implicit session: CQLSession): Map[Host, CQLBatch] = {
  val meta = session.getCluster.getMetadata
  statements.groupBy { st =>
try {
  meta.getReplicas(st.getKeyspace, st.getRoutingKey).iterator().next
} catch { case NonFatal(e) =>
  null
}
  } mapValues { st => CQLBatch(st) }
}

We now have a map of primary host to sub-batch for all the statements in
our logical batch.  We can now do either of these (depending on how greedy
we want to be in our client; Future.traverse is preferred and nicer,
Future.sequence is greedier and more resource intensive):

Future.sequence(groupByFirstReplica().values.map(_.execute())).map(_.flatten)
Future.traverse(groupByFirstReplica().values) { _.execute() }.map(_.flatten)

We get back Future[Iterable[ResultSet]] - this future completes when the
logical batch's sub-batches have all completed.

Note that with the DSE Java driver, for the above to succeed in its intent,
the statements need to be prepared statements (for st.getRoutingKey to
return non-null), and either the keyspace has to be fully defined in the
CQL, or you have to have set the correct keyspace when you created the
connection (for st.getKeyspace to return non-null).  Otherwise the values
given to meta.getReplicas will fail to resolve a primary host which results
in doing token-unaware batches (i.e. you'll get back a Map(null ->
allStatements)).  However those same criteria are required for single
statements to be token aware.




On Fri, Sep 25, 2015 at 7:30 AM, Ryan Svihla  wrote:

> Generally this is all correct but I cannot emphasize enough how much this
> “just depends” and today I generally move people to async inserts first
> before trying to micro-optimize some things to keep in mind.
>
>
>- compaction usually is the limiter for most clusters, so the
>difference between async versus unlogged batch ends up being minor or
>worse..non existent cause the hardware and data model combination result in
>compaction being the main throttle.
>- if you add in token awareness to your batch..you’ve basically
>eliminated the primary complaint of using unlogged batches so why not do
>that. When I was at DataStax I made some similar suggestions for token
>aware batch after seeing the perf improvements with Spark writes using
>unlogged batch. Several others did as well so I’m not the first one with
>this idea.
>- write size makes in my experience the largest difference BY FAR
>about which is faster. and the number is largely irrelevant compared to the
>total payload size. Depending on the hardware and etc a good rule of thumb
>is writes below 1k bytes tend to get really inefficient and writes that are
>over 100k tend to slow down total throughput. I’ll reemphasize this magic
>number has been different on almost every cluster I’ve tuned.
>
>
> In summary all this means is, too small or too large of writes are slow,
> and unlogged batches may involve some extra hops, if you eliminate the
> extra hops by token awareness then it just comes down to write size
> optimization.
>
> On Sep 24, 2015, at 5:18 PM, Eric Stevens  wrote:
>
> > I side-tracked some punctual benchmarks and stumbled on the
> observations of unlogged inserts being *A LOT* faster than the async
> counterparts.
>
> My own testing agrees very strongly with this.  When this topic came up on
> this list before, there was a concern that batch coordination produces GC
> pressure in your cluster because you're involving nodes which aren't *strictly
> speaking* necessary to be involved.
>
> Our own testing shows some small impact on this front, but really
> lightweight GC tuning mitigated the effects by putting a little more room
> in Xmn (if you're still on CMS garbage collector).  On G1GC (which is what
> we run in production) we weren't able to measure a difference.
>
> Our testing shows data loads being

Re: Unable to remove dead node from cluster.

2015-09-25 Thread Dikang Gu
@Jeff, yeah, I run the nodetool grep, and in my case, some nodes return
"301", and some nodes return "300". And 300 is the correct number of nodes
in my cluster.

So it does look like an inconsistent issue, can you open a jira for this?
Also, I'm looking for a quick fix/patch for this.

On Fri, Sep 25, 2015 at 7:43 AM, Nate McCall  wrote:

> A few other folks have reported issues with lingering dead nodes on large
> clusters - Jason Brown *just* gave an excellent gossip presentation at the
> summit regarding gossip optimizations for large clusters.
>
> Gossip is in the process of being refactored (here's at least one of the
> issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it
> would be worth opening an issue with as much information as you can provide
> to, at the very least, have information avaiable for others.
>
> On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa 
> wrote:
>
>> The stack trace is one similar to one I recall seeing recently, but don’t
>> have in front of me. This is an outside chance that is not at all certain
>> to be the case.
>>
>> For EACH of the hundreds of nodes in your cluster, I suggest you run
>>
>> nodetool status | egrep “(^UN|^DN)" | wc -l
>>
>> and count to see if every node really has every other node in its ring
>> properly.
>>
>> I suspect, but am not at all sure, that you have inconsistencies you’re
>> not yet aware of (for example, if you expect that you have 100 nodes in the
>> cluster, I’m betting that the query above returns 99 on at least one of the
>> nodes).  If this is the case, please reply so that you and I can submit a
>> Jira and compare our stack traces and we can find the underlying root cause
>> of this together.
>>
>> - Jeff
>>
>> From: Dikang Gu
>> Reply-To: "user@cassandra.apache.org"
>> Date: Thursday, September 24, 2015 at 9:10 PM
>> To: cassandra
>>
>> Subject: Re: Unable to remove dead node from cluster.
>>
>> @Jeff, I just use jmx connect to one node, run the
>> unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.
>>
>> Yes, we have hundreds of other nodes in the nodetool status output as
>> well.
>>
>> On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa 
>> wrote:
>>
>>> When you run unsafeAssassinateEndpoint, to which host are you connected,
>>> and what argument are you passing?
>>>
>>> Are there other nodes in the ring that you’re not including in the
>>> ‘nodetool status’ output?
>>>
>>>
>>> From: Dikang Gu
>>> Reply-To: "user@cassandra.apache.org"
>>> Date: Tuesday, September 22, 2015 at 10:09 PM
>>> To: cassandra
>>> Cc: "d...@cassandra.apache.org"
>>> Subject: Re: Unable to remove dead node from cluster.
>>>
>>> ping.
>>>
>>> On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu  wrote:
>>>
 I have tried all of them, neither of them worked.
 1. decommission: the host had hardware issue, and I can not connect to
 it.
 2. remove, there is not HostID, so the removenode did not work.
 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can
 we fix it?

 Thanks
 Dikang.

 On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
 sebastian.este...@datastax.com> wrote:

> Order is decommission, remove, assassinate.
>
> Which have you tried?
> On Sep 21, 2015 10:47 AM, "Dikang Gu"  wrote:
>
>> Hi there,
>>
>> I have a dead node in our cluster, which is a wired state right now,
>> and can not be removed from cluster.
>>
>> The nodestatus shows:
>> Datacenter: DC1
>> ===
>> Status=Up/Down
>> |/ State=Normal/Leaving/Joining/Moving
>> --  Address  Load   Tokens  OwnsHost
>> ID   Rack
>> DN  10.210.165.55?  256 ?   null
>>  r1
>>
>> I tried the unsafeAssassinateEndpoint, but got exception like:
>> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55
>> is now DOWN
>> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
>> Thread[GossipStage:1,5,main]
>> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
>> 2015-09-18_23:21:40.80669   at
>> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80669   at
>> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80670   at
>> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
>> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
>> 2015-09-18_23:21:40.80671   at
>> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
>> ~[apache-cassandra-2

Re: High read latency

2015-09-25 Thread Jaydeep Chovatia
I understand that but everything is in RAM (my data dir is tmpfs) and my
row is not that wide approx. less than 5MB in size. So my question is if
everything is in RAM then why does it take 43ms latency?

On Fri, Sep 25, 2015 at 7:54 AM, Ryan Svihla  wrote:

> if you run:
>
> nodetool cfhistograms  
>
> On the given table and that will tell you how wide your rows are getting.
> At some point you can get wide enough rows that just the physics of
> retrieving them all take some time.
>
>
> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
> Jaydeep; since your primary key involves a clustering column, you may be
> having pretty wide rows. The read would be sequential. The latency could be
> acceptable, if the read were to involve really wide rows.
>
> If your primary key was like ((a,b)) without the clustering column, it's
> like reading a key value pair, and 40ms latency may have been a concern.
>
> Bottom Line : The latency depends on how wide the row is.
>
> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
> pskraj...@gmail.com> wrote:
>
>> thanks for the information. Posting the query too would be of help.
>>
>> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
>> chovatia.jayd...@gmail.com> wrote:
>>
>>> Please find required details here:
>>>
>>> -  Number of req/s
>>>
>>> 2k reads/s
>>>
>>> -  Schema details
>>>
>>> create table test {
>>>
>>> a timeuuid,
>>>
>>> b bigint,
>>>
>>> c int,
>>>
>>> d int static,
>>>
>>> e int static,
>>>
>>> f int static,
>>>
>>> g int static,
>>>
>>> h int,
>>>
>>> i text,
>>>
>>> j text,
>>>
>>> k text,
>>>
>>> l text,
>>>
>>> m set
>>>
>>> n bigint
>>>
>>> o bigint
>>>
>>> p bigint
>>>
>>> q bigint
>>>
>>> r int
>>>
>>> s text
>>>
>>> t bigint
>>>
>>> u text
>>>
>>> v text
>>>
>>> w text
>>>
>>> x bigint
>>>
>>> y bigint
>>>
>>> z bigint,
>>>
>>> primary key ((a, b), c)
>>>
>>> };
>>>
>>> -  JVM settings about the heap
>>>
>>> Default settings
>>>
>>> -  Execution time of the GC
>>>
>>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>>>
>>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric 
>>> wrote:
>>>
 Hi,





 Before speaking about tuning, can you provide some additional
 information ?



 -  Number of req/s

 -  Schema details

 -  JVM settings about the heap

 -  Execution time of the GC



 43ms for a read latency may be acceptable according to the number of
 request per second.





 Eric



 *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
 *Envoyé :* mardi 22 septembre 2015 00:07
 *À :* user@cassandra.apache.org
 *Objet :* High read latency



 Hi,



 My application issues more read requests than write, I do see that
 under load cfstats for one of the table is quite high around 43ms



 Local read count: 114479357

 Local read latency: 43.442 ms

 Local write count: 22288868

 Local write latency: 0.609 ms





 Here is my node configuration:

 RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB
 of data on each node (and for experiment purpose I stored data in tmpfs)



 I've tried increasing concurrent_read count upto 512 but no help in
 read latency. CPU/Memory/IO looks fine on system.



 Any idea what should I tune?



 Jaydeep

 --

 Ce message et les pièces jointes sont confidentiels et réservés à
 l'usage exclusif de ses destinataires. Il peut également être protégé par
 le secret professionnel. Si vous recevez ce message par erreur, merci d'en
 avertir immédiatement l'expéditeur et de le détruire. L'intégrité du
 message ne pouvant être assurée sur Internet, la responsabilité de
 Worldline ne pourra être recherchée quant au contenu de ce message. Bien
 que les meilleurs efforts soient faits pour maintenir cette transmission
 exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et
 sa responsabilité ne saurait être recherchée pour tout dommage résultant
 d'un virus transmis.

 This e-mail and the documents attached are confidential and intended
 solely for the addressee; it may also be privileged. If you receive this
 e-mail in error, please notify the sender immediately and destroy it. As
 its integrity cannot be secured on the Internet, the Worldline liability
 cannot be triggered for the message content. Although the sender endeavours
 to maintain a computer virus-free network, the sender does not warrant that
 this transmission is virus-free and will not be liable for any damages
 result

Re: memory usage problem of Metadata.tokenMap.tokenToHost

2015-09-25 Thread Ryan Svihla
In practice there are not many good reasons to use that many keyspaces and 
tables. If the use case is multi tenancy then you’re almost always better off 
just using a combination of version tables and tenantId to give you flexibility 
as well as separation of client data. If you have that many data types then it 
maybe worth considering a blog or text value for the data and then a data type 
column so you can serialize and deserialize on the client. 




> On Sep 22, 2015, at 3:09 AM, horschi  wrote:
> 
> Hi Joseph,
> 
> I think 2000 keyspaces might be just too much. Fewer keyspaces (and CFs) will 
> probably work much better.
> 
> kind regards,
> Christian
> 
> 
> On Tue, Sep 22, 2015 at 9:29 AM, joseph gao  > wrote:
> Hi, anybody could help me?
> 
> 2015-09-21 0:47 GMT+08:00 joseph gao  >:
> ps : that's the code in java drive , in MetaData.TokenMap.build:
> for (KeyspaceMetadata keyspace : keyspaces)
> {
> ReplicationStrategy strategy = keyspace.replicationStrategy();
> Map> ksTokens = (strategy == null)
> ? makeNonReplicatedMap(tokenToPrimary)
> : strategy.computeTokenToReplicaMap(tokenToPrimary, ring);
> 
> tokenToHosts.put(keyspace.getName(), ksTokens);
> tokenToPrimary is all same, ring is all same, and if strategy is all same , 
> strategy.computeTokenToReplicaMap would return 'same' map but different 
> object( cause every calling returns a new HashMap
> 
> 2015-09-21 0:22 GMT+08:00 joseph gao  >:
> cassandra: 2.1.7
> java driver: datastax java driver 2.1.6
> 
> Here is the problem:
>My application uses 2000+ keyspaces, and will dynamically create keyspaces 
> and tables. And then in java client, the Metadata.tokenMap.tokenToHost would 
> use about 1g memory. so this will cause a lot of  full gc.
>As I see, the key of the tokenToHost is keyspace, and the value is a 
> tokenId_to_replicateNodes map.
> 
>When I try to solve this problem, I find something not sure: all keyspaces 
> have same 'tokenId_to_replicateNodes' map.
> My replication strategy of all keyspaces is : simpleStrategy and 
> replicationFactor is 3
> 
> So would it be possible if keyspaces use same strategy, the value of 
> tokenToHost map use a same map. So it would extremely reduce the memory usage
> 
>  thanks a lot
> 
> -- 
> --
> Joseph Gao
> PhoneNum:15210513582
> QQ: 409343351
> 
> 
> 
> -- 
> --
> Joseph Gao
> PhoneNum:15210513582
> QQ: 409343351
> 
> 
> 
> -- 
> --
> Joseph Gao
> PhoneNum:15210513582
> QQ: 409343351
> 



Re: High read latency

2015-09-25 Thread Ryan Svihla
if you run:

nodetool cfhistograms  

On the given table and that will tell you how wide your rows are getting. At 
some point you can get wide enough rows that just the physics of retrieving 
them all take some time. 


> On Sep 25, 2015, at 9:21 AM, sai krishnam raju potturi  
> wrote:
> 
> Jaydeep; since your primary key involves a clustering column, you may be 
> having pretty wide rows. The read would be sequential. The latency could be 
> acceptable, if the read were to involve really wide rows.
> 
> If your primary key was like ((a,b)) without the clustering column, it's like 
> reading a key value pair, and 40ms latency may have been a concern. 
> 
> Bottom Line : The latency depends on how wide the row is.
> 
> On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi 
> mailto:pskraj...@gmail.com>> wrote:
> thanks for the information. Posting the query too would be of help.
> 
> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia 
> mailto:chovatia.jayd...@gmail.com>> wrote:
> Please find required details here:
> 
> -  Number of req/s
> 
> 2k reads/s
> 
> -  Schema details
> 
> create table test {
> 
> a timeuuid,
> 
> b bigint,
> 
> c int,
> 
> d int static,
> 
> e int static,
> 
> f int static,
> 
> g int static,
> 
> h int,
> 
> i text,
> 
> j text,
> 
> k text,
> 
> l text,
> 
> m set
> 
> n bigint
> 
> o bigint
> 
> p bigint
> 
> q bigint
> 
> r int
> 
> s text
> 
> t bigint
> 
> u text
> 
> v text
> 
> w text
> 
> x bigint
> 
> y bigint
> 
> z bigint,
> 
> primary key ((a, b), c)
> 
> };
> 
> -  JVM settings about the heap
> 
> Default settings
> 
> -  Execution time of the GC
> 
> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
> 
> 
> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric  > wrote:
> Hi,
> 
>  
> 
>  
> 
> Before speaking about tuning, can you provide some additional information ?
> 
>  
> 
> -  Number of req/s
> 
> -  Schema details
> 
> -  JVM settings about the heap
> 
> -  Execution time of the GC
> 
>  
> 
> 43ms for a read latency may be acceptable according to the number of request 
> per second.
> 
>  
> 
>  
> 
> Eric
> 
>  
> 
> De : Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com 
> ] 
> Envoyé : mardi 22 septembre 2015 00:07
> À : user@cassandra.apache.org 
> Objet : High read latency
> 
>  
> 
> Hi,
> 
>  
> 
> My application issues more read requests than write, I do see that under load 
> cfstats for one of the table is quite high around 43ms
> 
>  
> 
> Local read count: 114479357
> 
> Local read latency: 43.442 ms
> 
> Local write count: 22288868
> 
> Local write latency: 0.609 ms
> 
>  
> 
>  
> 
> Here is my node configuration:
> 
> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB of data 
> on each node (and for experiment purpose I stored data in tmpfs)
> 
>  
> 
> I've tried increasing concurrent_read count upto 512 but no help in read 
> latency. CPU/Memory/IO looks fine on system.
> 
>  
> 
> Any idea what should I tune?
> 
>  
> 
> Jaydeep
> 
> 
> 
> Ce message et les pièces jointes sont confidentiels et réservés à l'usage 
> exclusif de ses destinataires. Il peut également être protégé par le secret 
> professionnel. Si vous recevez ce message par erreur, merci d'en avertir 
> immédiatement l'expéditeur et de le détruire. L'intégrité du message ne 
> pouvant être assurée sur Internet, la responsabilité de Worldline ne pourra 
> être recherchée quant au contenu de ce message. Bien que les meilleurs 
> efforts soient faits pour maintenir cette transmission exempte de tout virus, 
> l'expéditeur ne donne aucune garantie à cet égard et sa responsabilité ne 
> saurait être recherchée pour tout dommage résultant d'un virus transmis.
> 
> This e-mail and the documents attached are confidential and intended solely 
> for the addressee; it may also be privileged. If you receive this e-mail in 
> error, please notify the sender immediately and destroy it. As its integrity 
> cannot be secured on the Internet, the Worldline liability cannot be 
> triggered for the message content. Although the sender endeavours to maintain 
> a computer virus-free network, the sender does not warrant that this 
> transmission is virus-free and will not be liable for any damages resulting 
> from any virus transmitted.
> 
> 
> 

Regards,

Ryan Svihla



Re: Unable to remove dead node from cluster.

2015-09-25 Thread Nate McCall
A few other folks have reported issues with lingering dead nodes on large
clusters - Jason Brown *just* gave an excellent gossip presentation at the
summit regarding gossip optimizations for large clusters.

Gossip is in the process of being refactored (here's at least one of the
issues: https://issues.apache.org/jira/browse/CASSANDRA-9667), but it would
be worth opening an issue with as much information as you can provide to,
at the very least, have information avaiable for others.

On Fri, Sep 25, 2015 at 7:08 AM, Jeff Jirsa 
wrote:

> The stack trace is one similar to one I recall seeing recently, but don’t
> have in front of me. This is an outside chance that is not at all certain
> to be the case.
>
> For EACH of the hundreds of nodes in your cluster, I suggest you run
>
> nodetool status | egrep “(^UN|^DN)" | wc -l
>
> and count to see if every node really has every other node in its ring
> properly.
>
> I suspect, but am not at all sure, that you have inconsistencies you’re
> not yet aware of (for example, if you expect that you have 100 nodes in the
> cluster, I’m betting that the query above returns 99 on at least one of the
> nodes).  If this is the case, please reply so that you and I can submit a
> Jira and compare our stack traces and we can find the underlying root cause
> of this together.
>
> - Jeff
>
> From: Dikang Gu
> Reply-To: "user@cassandra.apache.org"
> Date: Thursday, September 24, 2015 at 9:10 PM
> To: cassandra
>
> Subject: Re: Unable to remove dead node from cluster.
>
> @Jeff, I just use jmx connect to one node, run the
> unsafeAssainateEndpoint, and pass in the "10.210.165.55" ip address.
>
> Yes, we have hundreds of other nodes in the nodetool status output as well.
>
> On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa 
> wrote:
>
>> When you run unsafeAssassinateEndpoint, to which host are you connected,
>> and what argument are you passing?
>>
>> Are there other nodes in the ring that you’re not including in the
>> ‘nodetool status’ output?
>>
>>
>> From: Dikang Gu
>> Reply-To: "user@cassandra.apache.org"
>> Date: Tuesday, September 22, 2015 at 10:09 PM
>> To: cassandra
>> Cc: "d...@cassandra.apache.org"
>> Subject: Re: Unable to remove dead node from cluster.
>>
>> ping.
>>
>> On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu  wrote:
>>
>>> I have tried all of them, neither of them worked.
>>> 1. decommission: the host had hardware issue, and I can not connect to
>>> it.
>>> 2. remove, there is not HostID, so the removenode did not work.
>>> 3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can
>>> we fix it?
>>>
>>> Thanks
>>> Dikang.
>>>
>>> On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez <
>>> sebastian.este...@datastax.com> wrote:
>>>
 Order is decommission, remove, assassinate.

 Which have you tried?
 On Sep 21, 2015 10:47 AM, "Dikang Gu"  wrote:

> Hi there,
>
> I have a dead node in our cluster, which is a wired state right now,
> and can not be removed from cluster.
>
> The nodestatus shows:
> Datacenter: DC1
> ===
> Status=Up/Down
> |/ State=Normal/Leaving/Joining/Moving
> --  Address  Load   Tokens  OwnsHost
> ID   Rack
> DN  10.210.165.55?  256 ?   null
>r1
>
> I tried the unsafeAssassinateEndpoint, but got exception like:
> 2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55
> is now DOWN
> 2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread
> Thread[GossipStage:1,5,main]
> 2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
> 2015-09-18_23:21:40.80669   at
> org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80669   at
> org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80670   at
> org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80671   at
> org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80671   at
> org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
> 2015-09-18_23:21:40.80672   at
> org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009)
> ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]

Re: Seeing null pointer exception 2.0.14 after purging gossip state

2015-09-25 Thread Ryan Svihla
could it be related to CASSANDRA-9180 
 which was fixed in 
2.0.15? although it really behaves like CASSANDRA-10231 
 which I don’t see any 
reference to it being in 2.0.x

> On Sep 24, 2015, at 12:57 PM, Robert Coli  wrote:
> 
> On Mon, Sep 14, 2015 at 7:53 PM, K F  > wrote:
> I have cassandra 2.0.14 deployed and after following the method described in 
> Apache Cassandra™ 2.0 
> 
>  to clear the gossip state of the node in one of the dc of my cluster
> 
> Why did you need to do this?
> 
>  I see wierd exception on the nodes not many but a few in an hour for nodes 
> that have already successfully decommissioned from the cluster, you can see 
> from below exception that 10.0.0.1 has been already decommissioned. Below is 
> the exception snippet. 
> 
> Have you done :
> 
> nodetool gossipinfo |grep SCHEMA |sort | uniq -c | sort -n
> 
> and checked for schema agreement... ?
> 
> =Rob
>  

Regards,

Ryan Svihla



Re: To batch or not to batch: A question for fast inserts

2015-09-25 Thread Ryan Svihla
Generally this is all correct but I cannot emphasize enough how much this “just 
depends” and today I generally move people to async inserts first before trying 
to micro-optimize some things to keep in mind.

compaction usually is the limiter for most clusters, so the difference between 
async versus unlogged batch ends up being minor or worse..non existent cause 
the hardware and data model combination result in compaction being the main 
throttle.
if you add in token awareness to your batch..you’ve basically eliminated the 
primary complaint of using unlogged batches so why not do that. When I was at 
DataStax I made some similar suggestions for token aware batch after seeing the 
perf improvements with Spark writes using unlogged batch. Several others did as 
well so I’m not the first one with this idea.
write size makes in my experience the largest difference BY FAR about which is 
faster. and the number is largely irrelevant compared to the total payload 
size. Depending on the hardware and etc a good rule of thumb is writes below 1k 
bytes tend to get really inefficient and writes that are over 100k tend to slow 
down total throughput. I’ll reemphasize this magic number has been different on 
almost every cluster I’ve tuned.

In summary all this means is, too small or too large of writes are slow, and 
unlogged batches may involve some extra hops, if you eliminate the extra hops 
by token awareness then it just comes down to write size optimization.

> On Sep 24, 2015, at 5:18 PM, Eric Stevens  wrote:
> 
> > I side-tracked some punctual benchmarks and stumbled on the observations of 
> > unlogged inserts being *A LOT* faster than the async counterparts.
> 
> My own testing agrees very strongly with this.  When this topic came up on 
> this list before, there was a concern that batch coordination produces GC 
> pressure in your cluster because you're involving nodes which aren't strictly 
> speaking necessary to be involved.  
> 
> Our own testing shows some small impact on this front, but really lightweight 
> GC tuning mitigated the effects by putting a little more room in Xmn (if 
> you're still on CMS garbage collector).  On G1GC (which is what we run in 
> production) we weren't able to measure a difference. 
> 
> Our testing shows data loads being as much as 5x to 8x faster when using 
> small concurrent batches over using single statements concurrently.  We tried 
> three different concurrency models.
> 
> To save on coordinator overhead, we group the statements in our "batch" by 
> replica (using the functionality exposed by the DataStax Java driver), and do 
> essentially token aware batching.  This still has a small amount of 
> additional coordinator overhead (since the data size of the unit of work is 
> larger, and sits in memory in the coordinator longer).  We've been running 
> this way successfully for months with sustained rates north of 50,000 mutates 
> per second.  We burst much higher.
> 
> Through trial and error we determined we got diminishing returns in the realm 
> of 100 statements per token-aware batch.  It looks like your own data bears 
> that out as well.  I'm sure that's workload dependent though.
> 
> I've been disagreed with on this topic in this list in the past despite the 
> numbers I was able to post.  Nobody has shown me numbers (nor anything else 
> concrete) that contradict my position though, so I stand by it.  There's no 
> question in my mind, if your mutates are of any significant volume and you 
> care about the performance of them, token aware unlogged batching is the 
> right strategy.  When we reduce our batch sizes or switch to single async 
> statements, we fall over immediately.  
> 
> On Tue, Sep 22, 2015 at 7:54 AM, Gerard Maas  > wrote:
> General advice advocates for individual async inserts as the fastest way to 
> insert data into Cassandra. Our insertion mechanism is based on that model 
> and recently we have been evaluating performance, looking to measure and 
> optimize our ingestion rate.
> 
> I side-tracked some punctual benchmarks and stumbled on the observations of 
> unlogged inserts being *A LOT* faster than the async counterparts.
> 
> In our tests, unlogged batch shows increased throughput and lower cluster CPU 
> usage, so I'm wondering where the tradeoff might be.
> 
> I compiled those observations in this document that I'm sharing and opening 
> up for comments.  Are we observing some artifact or should we set the record 
> straight for unlogged batches to achieve better insertion throughput?
> 
> https://docs.google.com/document/d/1qSIJ46cmjKggxm1yxboI-KhYJh1gnA6RK-FkfUg6FrI
>  
> 
> 
> Let me know.
> 
> Kind regards, 
> 
> Gerard.
> 

Regards,

Ryan Svihla



Re: How to tune Cassandra or Java Driver to get lower latency when there are a lot of writes?

2015-09-25 Thread Ryan Svihla
Why aren’t you using saveToCassandra 
(https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md
 
)?
 They have a number of locality aware optimizations that will probably exceed 
your by hand bulk loading (especially if you’re not doing it inside something 
like foreach partition).

Also you can easily tune up and down the size of those tasks and therefore 
batches to minimize harm on the prod system.

> On Sep 24, 2015, at 5:37 PM, Benyi Wang  wrote:
> 
> I use Spark and spark-cassandra-connector with a customized Cassandra writer 
> (spark-cassandra-connector doesn’t support DELETE). Basically the writer 
> works as follows:
> 
> Bind a row in Spark RDD with either INSERT/Delete PreparedStatement
> Create a BatchStatement for multiple rows
> Write to Cassandra.
> I knew using CQLBulkOutputFormat would be better, but it doesn't supports 
> DELETE. 
> 
> On Thu, Sep 24, 2015 at 1:27 PM, Gerard Maas  > wrote:
> How are you loading the data? I mean, what insert method are you using?
> 
> On Thu, Sep 24, 2015 at 9:58 PM, Benyi Wang  > wrote:
> I have a cassandra cluster provides data to a web service. And there is a 
> daily batch load writing data into the cluster.
> 
> Without the batch loading, the service’s Latency 99thPercentile is 3ms. But 
> during the load, it jumps to 90ms.
> I checked cassandra keyspace’s ReadLatency.99thPercentile, which jumps to 1ms 
> from 600 microsec.
> The service’s cassandra java driver request 99thPercentile was 90ms during 
> the load
> The java driver took the most time. I knew the Cassandra servers are busy in 
> writing, but I want to know what kinds of metrics can identify where is the 
> bottleneck so that I can tune it.
> 
> I’m using Cassandra 2.1.8 and Cassandra Java Driver 2.1.5.
> 
> 
> 

Regards,

Ryan Svihla



Re: High read latency

2015-09-25 Thread sai krishnam raju potturi
Jaydeep; since your primary key involves a clustering column, you may be
having pretty wide rows. The read would be sequential. The latency could be
acceptable, if the read were to involve really wide rows.

If your primary key was like ((a,b)) without the clustering column, it's
like reading a key value pair, and 40ms latency may have been a concern.

Bottom Line : The latency depends on how wide the row is.

On Tue, Sep 22, 2015 at 1:27 PM, sai krishnam raju potturi <
pskraj...@gmail.com> wrote:

> thanks for the information. Posting the query too would be of help.
>
> On Tue, Sep 22, 2015 at 11:56 AM, Jaydeep Chovatia <
> chovatia.jayd...@gmail.com> wrote:
>
>> Please find required details here:
>>
>> -  Number of req/s
>>
>> 2k reads/s
>>
>> -  Schema details
>>
>> create table test {
>>
>> a timeuuid,
>>
>> b bigint,
>>
>> c int,
>>
>> d int static,
>>
>> e int static,
>>
>> f int static,
>>
>> g int static,
>>
>> h int,
>>
>> i text,
>>
>> j text,
>>
>> k text,
>>
>> l text,
>>
>> m set
>>
>> n bigint
>>
>> o bigint
>>
>> p bigint
>>
>> q bigint
>>
>> r int
>>
>> s text
>>
>> t bigint
>>
>> u text
>>
>> v text
>>
>> w text
>>
>> x bigint
>>
>> y bigint
>>
>> z bigint,
>>
>> primary key ((a, b), c)
>>
>> };
>>
>> -  JVM settings about the heap
>>
>> Default settings
>>
>> -  Execution time of the GC
>>
>> Avg. 400ms. I do not see long pauses of GC anywhere in the log file.
>>
>> On Tue, Sep 22, 2015 at 5:34 AM, Leleu Eric 
>> wrote:
>>
>>> Hi,
>>>
>>>
>>>
>>>
>>>
>>> Before speaking about tuning, can you provide some additional
>>> information ?
>>>
>>>
>>>
>>> -  Number of req/s
>>>
>>> -  Schema details
>>>
>>> -  JVM settings about the heap
>>>
>>> -  Execution time of the GC
>>>
>>>
>>>
>>> 43ms for a read latency may be acceptable according to the number of
>>> request per second.
>>>
>>>
>>>
>>>
>>>
>>> Eric
>>>
>>>
>>>
>>> *De :* Jaydeep Chovatia [mailto:chovatia.jayd...@gmail.com]
>>> *Envoyé :* mardi 22 septembre 2015 00:07
>>> *À :* user@cassandra.apache.org
>>> *Objet :* High read latency
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> My application issues more read requests than write, I do see that under
>>> load cfstats for one of the table is quite high around 43ms
>>>
>>>
>>>
>>> Local read count: 114479357
>>>
>>> Local read latency: 43.442 ms
>>>
>>> Local write count: 22288868
>>>
>>> Local write latency: 0.609 ms
>>>
>>>
>>>
>>>
>>>
>>> Here is my node configuration:
>>>
>>> RF=3, Read/Write with QUORUM, 64GB RAM, 48 CPU core. I have only 5 GB of
>>> data on each node (and for experiment purpose I stored data in tmpfs)
>>>
>>>
>>>
>>> I've tried increasing concurrent_read count upto 512 but no help in read
>>> latency. CPU/Memory/IO looks fine on system.
>>>
>>>
>>>
>>> Any idea what should I tune?
>>>
>>>
>>>
>>> Jaydeep
>>>
>>> --
>>>
>>> Ce message et les pièces jointes sont confidentiels et réservés à
>>> l'usage exclusif de ses destinataires. Il peut également être protégé par
>>> le secret professionnel. Si vous recevez ce message par erreur, merci d'en
>>> avertir immédiatement l'expéditeur et de le détruire. L'intégrité du
>>> message ne pouvant être assurée sur Internet, la responsabilité de
>>> Worldline ne pourra être recherchée quant au contenu de ce message. Bien
>>> que les meilleurs efforts soient faits pour maintenir cette transmission
>>> exempte de tout virus, l'expéditeur ne donne aucune garantie à cet égard et
>>> sa responsabilité ne saurait être recherchée pour tout dommage résultant
>>> d'un virus transmis.
>>>
>>> This e-mail and the documents attached are confidential and intended
>>> solely for the addressee; it may also be privileged. If you receive this
>>> e-mail in error, please notify the sender immediately and destroy it. As
>>> its integrity cannot be secured on the Internet, the Worldline liability
>>> cannot be triggered for the message content. Although the sender endeavours
>>> to maintain a computer virus-free network, the sender does not warrant that
>>> this transmission is virus-free and will not be liable for any damages
>>> resulting from any virus transmitted.
>>>
>>
>>
>


Re: Unable to remove dead node from cluster.

2015-09-25 Thread Jeff Jirsa
The stack trace is one similar to one I recall seeing recently, but don’t have 
in front of me. This is an outside chance that is not at all certain to be the 
case.

For EACH of the hundreds of nodes in your cluster, I suggest you run 

nodetool status | egrep “(^UN|^DN)" | wc -l 

and count to see if every node really has every other node in its ring 
properly. 

I suspect, but am not at all sure, that you have inconsistencies you’re not yet 
aware of (for example, if you expect that you have 100 nodes in the cluster, 
I’m betting that the query above returns 99 on at least one of the nodes).  If 
this is the case, please reply so that you and I can submit a Jira and compare 
our stack traces and we can find the underlying root cause of this together. 

- Jeff

From:  Dikang Gu
Reply-To:  "user@cassandra.apache.org"
Date:  Thursday, September 24, 2015 at 9:10 PM
To:  cassandra
Subject:  Re: Unable to remove dead node from cluster.

@Jeff, I just use jmx connect to one node, run the unsafeAssainateEndpoint, and 
pass in the "10.210.165.55" ip address.

Yes, we have hundreds of other nodes in the nodetool status output as well.

On Tue, Sep 22, 2015 at 11:31 PM, Jeff Jirsa  wrote:
When you run unsafeAssassinateEndpoint, to which host are you connected, and 
what argument are you passing?

Are there other nodes in the ring that you’re not including in the ‘nodetool 
status’ output?


From: Dikang Gu
Reply-To: "user@cassandra.apache.org"
Date: Tuesday, September 22, 2015 at 10:09 PM
To: cassandra
Cc: "d...@cassandra.apache.org"
Subject: Re: Unable to remove dead node from cluster.

ping.

On Mon, Sep 21, 2015 at 11:51 AM, Dikang Gu  wrote:
I have tried all of them, neither of them worked. 
1. decommission: the host had hardware issue, and I can not connect to it.
2. remove, there is not HostID, so the removenode did not work.
3. unsafeAssassinateEndpoint, it will throw NPE as I pasted before, can we fix 
it?

Thanks
Dikang.

On Mon, Sep 21, 2015 at 11:11 AM, Sebastian Estevez 
 wrote:

Order is decommission, remove, assassinate.

Which have you tried?

On Sep 21, 2015 10:47 AM, "Dikang Gu"  wrote:
Hi there, 

I have a dead node in our cluster, which is a wired state right now, and can 
not be removed from cluster.

The nodestatus shows:
Datacenter: DC1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens  OwnsHost ID 
  Rack
DN  10.210.165.55?  256 ?   null
  r1

I tried the unsafeAssassinateEndpoint, but got exception like:
2015-09-18_23:21:40.79760 INFO  23:21:40 InetAddress /10.210.165.55 is now DOWN
2015-09-18_23:21:40.80667 ERROR 23:21:40 Exception in thread 
Thread[GossipStage:1,5,main]
2015-09-18_23:21:40.80668 java.lang.NullPointerException: null
2015-09-18_23:21:40.80669   at 
org.apache.cassandra.service.StorageService.getApplicationStateValue(StorageService.java:1584)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80669   at 
org.apache.cassandra.service.StorageService.getTokensFor(StorageService.java:1592)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80670   at 
org.apache.cassandra.service.StorageService.handleStateLeft(StorageService.java:1822)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671   at 
org.apache.cassandra.service.StorageService.onChange(StorageService.java:1495) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80671   at 
org.apache.cassandra.service.StorageService.onJoin(StorageService.java:2121) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80672   at 
org.apache.cassandra.gms.Gossiper.handleMajorStateChange(Gossiper.java:1009) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673   at 
org.apache.cassandra.gms.Gossiper.applyStateLocally(Gossiper.java:1113) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673   at 
org.apache.cassandra.gms.GossipDigestAck2VerbHandler.doVerb(GossipDigestAck2VerbHandler.java:49)
 ~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80673   at 
org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:62) 
~[apache-cassandra-2.1.8+git20150804.076b0b1.jar:2.1.8+git20150804.076b0b1]
2015-09-18_23:21:40.80674   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
~[na:1.7.0_45]
2015-09-18_23:21:40.80674   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
~[na:1.7.0_45]
2015-09-18_23:21:40.80674   at java.lang.Thread.run(Thread.java:744) 
~[na:1.7.0_45]
2015-09-18_2