Re: nodetool repair failure

2017-06-29 Thread Balaji Venkatesan
It did not help much. But other issue or error I saw when I repair the
keyspace was it says

"Sync failed between /xx.xx.xx.93 and /xx.xx.xx.94" this was run from .91
node.



On Thu, Jun 29, 2017 at 4:44 PM, Akhil Mehra  wrote:

> Run the following query and see if it gives you more information:
>
> select * from system_distributed.repair_history;
>
> Also is there any additional logging on the nodes where the error is
> coming from. Seems to be xx.xx.xx.94 for your last run.
>
>
> On 30/06/2017, at 9:43 AM, Balaji Venkatesan 
> wrote:
>
> The verify and scrub went without any error on the keyspace. I ran it
> again with trace mode and still the same issue
>
>
> [2017-06-29 21:37:45,578] Parsing UPDATE 
> system_distributed.parent_repair_history
> SET finished_at = toTimestamp(now()), successful_ranges = {'} WHERE
> parent_id=f1f10af0-5d12-11e7-8df9-59d19ef3dd23
> [2017-06-29 21:37:45,580] Preparing statement
> [2017-06-29 21:37:45,580] Determining replicas for mutation
> [2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.95
> [2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.94
> [2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.93
> [2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from
> /xx.xx.xx.93
> [2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from
> /xx.xx.xx.94
> [2017-06-29 21:37:45,581] Processing response from /xx.xx.xx.93
> [2017-06-29 21:37:45,581] /xx.xx.xx.94: MUTATION message received from
> /xx.xx.xx.91
> [2017-06-29 21:37:45,582] Processing response from /xx.xx.xx.94
> [2017-06-29 21:37:45,582] /xx.xx.xx.93: MUTATION message received from
> /xx.xx.xx.91
> [2017-06-29 21:37:45,582] /xx.xx.xx.95: MUTATION message received from
> /xx.xx.xx.91
> [2017-06-29 21:37:45,582] /xx.xx.xx.94: Appending to commitlog
> [2017-06-29 21:37:45,582] /xx.xx.xx.94: Adding to parent_repair_history
> memtable
> [2017-06-29 21:37:45,582] Some repair failed
> [2017-06-29 21:37:45,582] Repair command #3 finished in 1 minute 44 seconds
> error: Repair job has failed with the error message: [2017-06-29
> 21:37:45,582] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message:
> [2017-06-29 21:37:45,582] Some repair failed
> at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
> at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListene
> r.handleNotification(JMXNotificationProgressListener.java:77)
> at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.
> dispatchNotification(ClientNotifForwarder.java:583)
> at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(
> ClientNotifForwarder.java:533)
> at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(
> ClientNotifForwarder.java:452)
> at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(
> ClientNotifForwarder.java:108)
>
>
>
> On Thu, Jun 29, 2017 at 1:36 PM, Subroto Barua <
> sbarua...@yahoo.com.invalid> wrote:
>
>> Balaji,
>>
>> Are you repairing a specific keyspace/table? if the failure is tied to a
>> table, try 'verify' and 'scrub' options on .91...see if you get any errors.
>>
>>
>>
>>
>> On Thursday, June 29, 2017, 12:12:14 PM PDT, Balaji Venkatesan <
>> venkatesan.bal...@gmail.com> wrote:
>>
>>
>> Thanks. I tried with trace option and there is not much info. Here are
>> the few log lines just before it failed.
>>
>>
>> [2017-06-29 19:01:54,969] /xx.xx.xx.93: Sending REPAIR_MESSAGE message to
>> /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
>> to /xx.xx.xx.91
>> [2017-06-29 19:01:54,969] /xx.x

Re: timeoutexceptions with UDF causing cassandra forceful exits

2017-06-29 Thread Akhil Mehra
By default user_function_timeout_policy is set to die i.e. warn and kill the 
JVM. Please find below a source code snippet that outlines possible setting.

   /**
 * Defines what to do when a UDF ran longer than 
user_defined_function_fail_timeout.
 * Possible options are:
 * - 'die' - i.e. it is able to emit a warning to the client before the 
Cassandra Daemon will shut down.
 * - 'die_immediate' - shut down C* daemon immediately (effectively prevent 
the chance that the client will receive a warning).
 * - 'ignore' - just log - the most dangerous option.
 * (Only valid, if enable_user_defined_functions_threads==true)
 */
public UserFunctionTimeoutPolicy user_function_timeout_policy = 
UserFunctionTimeoutPolicy.die;

To answer your question. Yes it is normal for Cassandra to shut down due to a 
rogue UDF.

Warm Regards,
Akhil Mehra

> On 30/06/2017, at 11:17 AM, Gopal, Dhruva  wrote:
> 
> Hi –
>   Is it normal for cassand to be shutdown forcefully on timeout exceptions 
> when using UDFs? We are admittedly trying some load tests on our dev 
> environments which may be somewhat constrained, but didn’t expect to see 
> forceful shutdowns such as these when we ran our tests. We’re running 
> Cassandra 3.10. Dev environment is an MBP (Core i7/16GB RAM). Sample error 
> below – feedback will be much appreciated.
>  
> ERROR [NonPeriodicTasks:1] 2017-06-29 10:48:54,476 
> JVMStabilityInspector.java:142 - JVM state determined to be unstable.  
> Exiting forcefully due to:
> java.util.concurrent.TimeoutException: User defined function reporting.latest 
> : (map>>, bigint, timestamp, boolean) -> 
> map>> ran longer than 1500ms - will stop 
> Cassandra VM
> at 
> org.apache.cassandra.cql3.functions.UDFunction.async(UDFunction.java:483) 
> ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.functions.UDFunction.executeAsync(UDFunction.java:398)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.functions.UDFunction.execute(UDFunction.java:298) 
> ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.selection.ScalarFunctionSelector.getOutput(ScalarFunctionSelector.java:61)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.selection.Selection$SelectionWithProcessing$1.getOutputRow(Selection.java:592)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.selection.Selection$ResultSetBuilder.getOutputRow(Selection.java:430)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.selection.Selection$ResultSetBuilder.build(Selection.java:417)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:763)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:400)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:378)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:251)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:79)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:217)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:523)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:500)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:146)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:517)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:410)
>  ~[apache-cassandra-3.10.jar:3.10]
> at 
> io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
>  ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366)
>  ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
>  ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
> at 
> io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357)
>  ~[netty

Re: nodetool repair failure

2017-06-29 Thread Akhil Mehra
Run the following query and see if it gives you more information:

select * from system_distributed.repair_history;

Also is there any additional logging on the nodes where the error is coming 
from. Seems to be xx.xx.xx.94 for your last run.


> On 30/06/2017, at 9:43 AM, Balaji Venkatesan  
> wrote:
> 
> The verify and scrub went without any error on the keyspace. I ran it again 
> with trace mode and still the same issue
> 
> 
> [2017-06-29 21:37:45,578] Parsing UPDATE 
> system_distributed.parent_repair_history SET finished_at = 
> toTimestamp(now()), successful_ranges = {'} WHERE 
> parent_id=f1f10af0-5d12-11e7-8df9-59d19ef3dd23
> [2017-06-29 21:37:45,580] Preparing statement
> [2017-06-29 21:37:45,580] Determining replicas for mutation
> [2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.95
> [2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.94
> [2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.93
> [2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from /xx.xx.xx.93
> [2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from /xx.xx.xx.94
> [2017-06-29 21:37:45,581] Processing response from /xx.xx.xx.93
> [2017-06-29 21:37:45,581] /xx.xx.xx.94: MUTATION message received from 
> /xx.xx.xx.91
> [2017-06-29 21:37:45,582] Processing response from /xx.xx.xx.94
> [2017-06-29 21:37:45,582] /xx.xx.xx.93: MUTATION message received from 
> /xx.xx.xx.91
> [2017-06-29 21:37:45,582] /xx.xx.xx.95: MUTATION message received from 
> /xx.xx.xx.91
> [2017-06-29 21:37:45,582] /xx.xx.xx.94: Appending to commitlog
> [2017-06-29 21:37:45,582] /xx.xx.xx.94: Adding to parent_repair_history 
> memtable
> [2017-06-29 21:37:45,582] Some repair failed
> [2017-06-29 21:37:45,582] Repair command #3 finished in 1 minute 44 seconds
> error: Repair job has failed with the error message: [2017-06-29 
> 21:37:45,582] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error message: 
> [2017-06-29 21:37:45,582] Some repair failed
>   at 
> org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
>   at 
> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
>   at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
>   at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
>   at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
>   at 
> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)
> 
> 
> 
> On Thu, Jun 29, 2017 at 1:36 PM, Subroto Barua  > wrote:
> Balaji,
> 
> Are you repairing a specific keyspace/table? if the failure is tied to a 
> table, try 'verify' and 'scrub' options on .91...see if you get any errors.
> 
> 
> 
> 
> On Thursday, June 29, 2017, 12:12:14 PM PDT, Balaji Venkatesan 
> mailto:venkatesan.bal...@gmail.com>> wrote:
> 
> 
> Thanks. I tried with trace option and there is not much info. Here are the 
> few log lines just before it failed.
> 
> 
> [2017-06-29 19:01:54,969] /xx.xx.xx.93: Sending REPAIR_MESSAGE message to 
> /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to 
> /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to 
> /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to 
> /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending 

timeoutexceptions with UDF causing cassandra forceful exits

2017-06-29 Thread Gopal, Dhruva
Hi –
  Is it normal for cassand to be shutdown forcefully on timeout exceptions when 
using UDFs? We are admittedly trying some load tests on our dev environments 
which may be somewhat constrained, but didn’t expect to see forceful shutdowns 
such as these when we ran our tests. We’re running Cassandra 3.10. Dev 
environment is an MBP (Core i7/16GB RAM). Sample error below – feedback will be 
much appreciated.

ERROR [NonPeriodicTasks:1] 2017-06-29 10:48:54,476 
JVMStabilityInspector.java:142 - JVM state determined to be unstable.  Exiting 
forcefully due to:
java.util.concurrent.TimeoutException: User defined function reporting.latest : 
(map>>, bigint, timestamp, boolean) -> 
map>> ran longer than 1500ms - will stop 
Cassandra VM
at 
org.apache.cassandra.cql3.functions.UDFunction.async(UDFunction.java:483) 
~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.functions.UDFunction.executeAsync(UDFunction.java:398)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.functions.UDFunction.execute(UDFunction.java:298) 
~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.selection.ScalarFunctionSelector.getOutput(ScalarFunctionSelector.java:61)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.selection.Selection$SelectionWithProcessing$1.getOutputRow(Selection.java:592)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.selection.Selection$ResultSetBuilder.getOutputRow(Selection.java:430)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.selection.Selection$ResultSetBuilder.build(Selection.java:417)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.statements.SelectStatement.process(SelectStatement.java:763)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.statements.SelectStatement.processResults(SelectStatement.java:400)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:378)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:251)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.statements.SelectStatement.execute(SelectStatement.java:79)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.QueryProcessor.processStatement(QueryProcessor.java:217)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:523)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.cql3.QueryProcessor.processPrepared(QueryProcessor.java:500)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.transport.messages.ExecuteMessage.execute(ExecuteMessage.java:146)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:517)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.transport.Message$Dispatcher.channelRead0(Message.java:410)
 ~[apache-cassandra-3.10.jar:3.10]
at 
io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
 ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:366)
 ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
at 
io.netty.channel.AbstractChannelHandlerContext.access$600(AbstractChannelHandlerContext.java:35)
 ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
at 
io.netty.channel.AbstractChannelHandlerContext$7.run(AbstractChannelHandlerContext.java:357)
 ~[netty-all-4.0.39.Final.jar:4.0.39.Final]
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
~[na:1.8.0_101]
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:162)
 ~[apache-cassandra-3.10.jar:3.10]
at 
org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:109) 
~[apache-cassandra-3.10.jar:3.10]
at java.lang.Thread.run(Thread.java:745) ~[na:1.8.0_101]

Regards,
DHRUVA GOPAL
sr. MANAGER, ENGINEERING
REPORTING, ANALYTICS AND BIG DATA
+1 408.325.2011 WORK
+1 408.219.1094 MOBILE
UNITED STATES
dhruva.go...@aspect.com
aspect.com
[escription: http://webapp2.aspect.com/EmailSigLogo-rev.jpg]

This email (including any attachments) is proprietary to Aspect Software, Inc. 
and may contain information that is confidential. If you have received this 
message in error, please do not read, copy or forward this message. Please 
notify the sender immediately, delete it from your 

Re: nodetool removenode causing the schema out of sync

2017-06-29 Thread Jai Bheemsen Rao Dhanwada
Thanks Jeff,

Can you please suggest what value to tweak from the Cassandra side?

On Thu, Jun 29, 2017 at 2:53 PM, Jeff Jirsa  wrote:

>
>
> On 2017-06-29 13:45 (-0700), Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
> > Hello Jeff,
> >
> > Sorry the Version I am using 2.1.16, my first email had typo.
> > When I say schema out of sync
> >
> > 1. nodetool descriebcluster shows Schema versions same for all nodes.
>
> Ok got it, this is what I was most concerned with.
>
> > 2. nodetool removenode, shows the node down messages in the logs
> > 3. nodetool describecluster during this 1-2 mins shows several nodes as
> > UNREACHABLE and recovers with in a minute or two.
>
> This is likely due to overhead of streaming - you're probably running
> pretty close to your tipping point, and your streaming throughput creates
> enough GC pressure on the destinations to make them flap a bit. If you use
> the streaming throughput throttle, you may be able to help mitigate that
> somewhat (at the cost of speed).
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: nodetool removenode causing the schema out of sync

2017-06-29 Thread Jeff Jirsa


On 2017-06-29 13:45 (-0700), Jai Bheemsen Rao Dhanwada  
wrote: 
> Hello Jeff,
> 
> Sorry the Version I am using 2.1.16, my first email had typo.
> When I say schema out of sync
> 
> 1. nodetool descriebcluster shows Schema versions same for all nodes.

Ok got it, this is what I was most concerned with. 

> 2. nodetool removenode, shows the node down messages in the logs
> 3. nodetool describecluster during this 1-2 mins shows several nodes as
> UNREACHABLE and recovers with in a minute or two.

This is likely due to overhead of streaming - you're probably running pretty 
close to your tipping point, and your streaming throughput creates enough GC 
pressure on the destinations to make them flap a bit. If you use the streaming 
throughput throttle, you may be able to help mitigate that somewhat (at the 
cost of speed).



-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: nodetool repair failure

2017-06-29 Thread Balaji Venkatesan
The verify and scrub went without any error on the keyspace. I ran it again
with trace mode and still the same issue


[2017-06-29 21:37:45,578] Parsing UPDATE
system_distributed.parent_repair_history SET finished_at =
toTimestamp(now()), successful_ranges = {'} WHERE
parent_id=f1f10af0-5d12-11e7-8df9-59d19ef3dd23
[2017-06-29 21:37:45,580] Preparing statement
[2017-06-29 21:37:45,580] Determining replicas for mutation
[2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.95
[2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.94
[2017-06-29 21:37:45,580] Sending MUTATION message to /xx.xx.xx.93
[2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from
/xx.xx.xx.93
[2017-06-29 21:37:45,581] REQUEST_RESPONSE message received from
/xx.xx.xx.94
[2017-06-29 21:37:45,581] Processing response from /xx.xx.xx.93
[2017-06-29 21:37:45,581] /xx.xx.xx.94: MUTATION message received from
/xx.xx.xx.91
[2017-06-29 21:37:45,582] Processing response from /xx.xx.xx.94
[2017-06-29 21:37:45,582] /xx.xx.xx.93: MUTATION message received from
/xx.xx.xx.91
[2017-06-29 21:37:45,582] /xx.xx.xx.95: MUTATION message received from
/xx.xx.xx.91
[2017-06-29 21:37:45,582] /xx.xx.xx.94: Appending to commitlog
[2017-06-29 21:37:45,582] /xx.xx.xx.94: Adding to parent_repair_history
memtable
[2017-06-29 21:37:45,582] Some repair failed
[2017-06-29 21:37:45,582] Repair command #3 finished in 1 minute 44 seconds
error: Repair job has failed with the error message: [2017-06-29
21:37:45,582] Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message:
[2017-06-29 21:37:45,582] Some repair failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(ClientNotifForwarder.java:583)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:533)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
at
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)



On Thu, Jun 29, 2017 at 1:36 PM, Subroto Barua 
wrote:

> Balaji,
>
> Are you repairing a specific keyspace/table? if the failure is tied to a
> table, try 'verify' and 'scrub' options on .91...see if you get any errors.
>
>
>
>
> On Thursday, June 29, 2017, 12:12:14 PM PDT, Balaji Venkatesan <
> venkatesan.bal...@gmail.com> wrote:
>
>
> Thanks. I tried with trace option and there is not much info. Here are the
> few log lines just before it failed.
>
>
> [2017-06-29 19:01:54,969] /xx.xx.xx.93: Sending REPAIR_MESSAGE message to
> /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message
> to /xx.xx.xx.91
> [2017

Re: nodetool removenode causing the schema out of sync

2017-06-29 Thread Jai Bheemsen Rao Dhanwada
Hello Jeff,

Sorry the Version I am using 2.1.16, my first email had typo.
When I say schema out of sync

1. nodetool descriebcluster shows Schema versions same for all nodes.
2. nodetool removenode, shows the node down messages in the logs
3. nodetool describecluster during this 1-2 mins shows several nodes as
UNREACHABLE and recovers with in a minute or two.

On Thu, Jun 29, 2017 at 12:51 PM, Jeff Jirsa  wrote:

>
> 2.1.16 is old, but it's not as old as 2.1.6, which is what you originally
> put, and would be much more concerning.
>
> It is true, however, that 'removenode' involves streaming data, and
> streaming data can be GC intensive (especially with compression enabled),
> which means if your cluster is on the edge of health you may cause it to
> teeter over the edge during streaming, causing nodes to flap (the DOWN
> messages in the logs). That doesn't really explain the schema change,
> though - how confident are you that the schema was properly in sync prior
> to the removenode?
>
> - Jeff
>
> On 2017-06-29 09:49 (-0700), Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
> > Hello Jeff,
> >
> > Yes 2.1.16 is old version, and we are planning to upgrade in few months.
> >
> > Only the gossiper info is logged stating that it marked several nodes
> down
> > and nothing else.
> >
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Cassandra Cluster Expansion Criteria

2017-06-29 Thread Jeff Jirsa

50% disk free is really only required with STCS (in size tiered compaction, if 
you have 4 files of a similar size, they'll be joined together - there are 
theoretically times when all of your data is in 4 files of the same size, and 
to join them together you'll temporarily double your disk space). With LCS (and 
TWCS), you should be able to go to 70% or so, most of the time, because that 
"join everything together" compaction never happens in those strategies.

Keep an eye on CPU load and latencies - if you see it trending in the wrong 
direction beyond what you can tolerate in your SLA, you may want to consider 
scaling.


On 2017-06-29 06:48 (-0700), Nitan Kainth  wrote: 
> Ideally you should maintain 50% disk space.
> SLA and Node load is also very important to make the decision.
> 
> > On Jun 29, 2017, at 6:45 AM, ZAIDI, ASAD A  wrote:
> > 
> > Hello Folks,
> >  
> > I’m on Cassandra 2.2.8 cluster with 14 nodes , each with around 2TB of 
> > data volume. I’m looking for a criteria /or data points that can help me 
> > decide when or  if I should add more nodes to the cluster and by how many 
> > nodes.
> >  
> > I’ll really appreciate if you guys can share your insights.
> >  
> > Thanks/Asad
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: nodetool repair failure

2017-06-29 Thread Subroto Barua
Balaji,
Are you repairing a specific keyspace/table? if the failure is tied to a table, 
try 'verify' and 'scrub' options on .91...see if you get any errors.



On Thursday, June 29, 2017, 12:12:14 PM PDT, Balaji Venkatesan 
 wrote:

Thanks. I tried with trace option and there is not much info. Here are the few 
log lines just before it failed.

[2017-06-29 19:01:54,969] /xx.xx.xx.93: Sending REPAIR_MESSAGE message to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to 
commitlog[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history 
memtable[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to 
commitlog[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history 
memtable[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to 
commitlog[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history 
memtable[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to 
commitlog[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history 
memtable[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to 
commitlog[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history 
memtable[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to 
commitlog[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history 
memtable[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE 
message to /xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending 
REQUEST_RESPONSE message to /xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: 
Sending REQUEST_RESPONSE message to /xx.xx.xx.91[2017-06-29 19:01:54,969] 
/xx.xx.xx.92: Sending REQUEST_RESPONSE message to /xx.xx.xx.91[2017-06-29 
19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE 
message to /xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending 
REQUEST_RESPONSE message to /xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: 
Sending REQUEST_RESPONSE message to /xx.xx.xx.91[2017-06-29 19:01:54,969] 
/xx.xx.xx.92: Sending REQUEST_RESPONSE message to /xx.xx.xx.91[2017-06-29 
19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to 
/xx.xx.xx.91[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE 
message to /xx.xx.xx.91[2017-06-29 19:02:04,842] Some repair failed[2017-06-29 
19:02:04,848] Repair command #1 finished in 1 minute 2 secondserror: Repair job 
has failed with the error message: [2017-06-29 19:02:04,842] Some repair 
failed-- StackTrace --java.lang.RuntimeException: Repair job has failed with 
the error message: [2017-06-29 19:02:04,842] Some repair failed at 
org.apache.cassandra.tools. RepairRunner.progress( RepairRunner.java:116) at 
org.apache.cassandra.utils. progress.jmx. JMXNotificationProgressListene 
r.handleNotification( JMXNotificationProgressListene r.java:77) at 
com.sun.jmx.remote.internal. ClientNotifForwarder$ NotifFetcher. 
dispatchNotification( ClientNotifForwarder.java:583) at 
com.sun.jmx.remote.internal. ClientNotifForwarder$ NotifFetcher.doRun( 
ClientNotifForwarder.java:533) at com.sun.jmx.remote.internal. 
ClientNotifForwarder$ NotifFetcher.run( ClientNotifForwarder.java:452) at 
com.sun.jmx.remote.internal. ClientNotifForwarder$ LinearExecutor$1.run( 
ClientNotifForwarder.java:108)


FYI I am running repair from xx.xx.xx.91 node and its a 5 node cluster 
xx.xx.xx.91-xx.xx.xx.95
On Wed, Jun 28, 2017 at 5:16 PM, Akhil Mehra  wrote:

nodetool repair has a trace option 
nodetool repair -tr yourkeyspacename
see if that provides you with additional information.
Regards,Akhil 

On 28/06/2017, at 2:25 AM, Balaji Venkatesan  
wrote:

We use Apache Cassandra 3.10-13 

On Jun 26, 2017 8:41 PM, "Michael Shuler"  wrote:

What version of Cassandra?

--
Michael

On 06/26/2017 09:53 PM, Balaji Venkatesan wrote:
> Hi All,
>
> When I run nodetool repair on a keyspace I constantly get  "Some repair
> failed" error, there are no sufficient info to debug more. Any help?
>
> Here is the stacktrace
>
> == == ==
> [2017-06-27 02:44:34,275] Some repair failed
> [2017-06-27 02:44:34,279] Repair command #3 finished in 33 seconds
> error: Repair job has failed with the error message: [2017-06-27
> 02:44:34,275] Some repair failed
> -- StackTrace --
> java.lang.RuntimeException: Repair job has failed with the error
> message: [2017-06-27 02:44:34,275] Some repair failed
> at org.apache.cassandra.tools.Rep airRunner.progress(RepairRunne r.java:116)
> at
> org.apache.cassandra.utils.pro gress.jmx.JMXNotificationProgr e

Re: Repairing question

2017-06-29 Thread Javier Canillas
Thanks for all the responses. It's much clearer now.

2017-06-26 0:59 GMT-03:00 Paulo Motta :

> > Not sure since what version, but in 3.10 at least (I think its since 3.x
> started) full repair does do anti-compactions and marks sstables as
> repaired.
>
> Thanks for the correction, anti-compaction after full repairs was
> added on 2.2 CASSANDRA-7586 but removed on 4.0 by CASSANDRA-9143. Just
> for completeness, anti-compaction is not run when the following
> options are specified:
> -st/-et
> --local or --dc
> --full on 4.0+
>
> 2017-06-25 16:35 GMT-05:00 Cameron Zemek :
> >> When you perform a non-incremental repair data is repaired but not
> marked
> >> as repaired since this require anti-compaction to be run.
> >
> > Not sure since what version, but in 3.10 at least (I think its since 3.x
> > started) full repair does do anti-compactions and marks sstables as
> > repaired.
> >
> > On 23 June 2017 at 06:30, Paulo Motta  wrote:
> >>
> >> > This attribute seems to be only modified when executing "nodetool
> repair
> >> > [keyspace] [table]", but not when executing with other options
> like
> >> > --in-local-dc or --pr.
> >>
> >> This is correct behavior because this metric actually represent the
> >> percentage of SSTables incrementally repaired - and marked as repaired
> >> - which doesn't happen when you execute a non-incremental repair
> >> (--full, --in-local-dc, --pr). When you perform a non-incremental
> >> repair data is repaired but not marked as repaired since this require
> >> anti-compaction to be run.
> >>
> >> Actually this "percent repaired" display name is a bit misleading,
> >> since it sounds like data needs to be repaired while you could be
> >> running non-incremental repairs and still have data 100% repaired, so
> >> we should probably open a ticket to rename that to "Percent
> >> incrementally repaired" or similar.
> >>
> >>
> >> 2017-06-22 14:38 GMT-05:00 Javier Canillas :
> >> > Hi,
> >> >
> >> > I have been thinking about scheduling a daily routine to force repairs
> >> > on a
> >> > cluster to maintain its health.
> >> >
> >> > I saw that by running a nodetool tablestats [keyspace] there is an
> >> > attribute
> >> > called "Percent repaired" that show the percentage of data repaired on
> >> > the
> >> > each table.
> >> >
> >> > This attribute seems to be only modified when executing "nodetool
> repair
> >> > [keyspace] [table]", but not when executing with other options
> like
> >> > --in-local-dc or --pr.
> >> >
> >> > My main concern is about building the whole MERKLE tree for a big
> table.
> >> > I
> >> > have also check to repair by token ranges, but this also seems not to
> >> > modify
> >> > this attribute of the table.
> >> >
> >> > Is this an expected behavior? Or there is something missing on the
> code
> >> > that
> >> > needs to be fixed?
> >> >
> >> > My "maintenance" script would be calling nodetool tablestats per each
> >> > keyspace that has replication_factor > 0 to check for the value of the
> >> > "Percent repaired" of each table and, in case it is below some
> >> > threshold, I
> >> > would execute a repair on it.
> >> >
> >> > Any ideas?
> >> >
> >> > Thanks in advance.
> >> >
> >> > Javier.
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>
> >
>


Re: nodetool removenode causing the schema out of sync

2017-06-29 Thread Jeff Jirsa

2.1.16 is old, but it's not as old as 2.1.6, which is what you originally put, 
and would be much more concerning.

It is true, however, that 'removenode' involves streaming data, and streaming 
data can be GC intensive (especially with compression enabled), which means if 
your cluster is on the edge of health you may cause it to teeter over the edge 
during streaming, causing nodes to flap (the DOWN messages in the logs). That 
doesn't really explain the schema change, though - how confident are you that 
the schema was properly in sync prior to the removenode?

- Jeff

On 2017-06-29 09:49 (-0700), Jai Bheemsen Rao Dhanwada  
wrote: 
> Hello Jeff,
> 
> Yes 2.1.16 is old version, and we are planning to upgrade in few months.
> 
> Only the gossiper info is logged stating that it marked several nodes down
> and nothing else.
> 
> 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: nodetool repair failure

2017-06-29 Thread Balaji Venkatesan
Thanks. I tried with trace option and there is not much info. Here are the
few log lines just before it failed.


[2017-06-29 19:01:54,969] /xx.xx.xx.93: Sending REPAIR_MESSAGE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Appending to commitlog
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Adding to repair_history memtable
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Enqueuing response to /xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:01:54,969] /xx.xx.xx.92: Sending REQUEST_RESPONSE message to
/xx.xx.xx.91
[2017-06-29 19:02:04,842] Some repair failed
[2017-06-29 19:02:04,848] Repair command #1 finished in 1 minute 2 seconds
error: Repair job has failed with the error message: [2017-06-29
19:02:04,842] Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message:
[2017-06-29 19:02:04,842] Some repair failed
at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListene
r.handleNotification(JMXNotificationProgressListener.java:77)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.
dispatchNotification(ClientNotifForwarder.java:583)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(
ClientNotifForwarder.java:533)
at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(
ClientNotifForwarder.java:452)
at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(
ClientNotifForwarder.java:108)



FYI I am running repair from xx.xx.xx.91 node and its a 5 node cluster
xx.xx.xx.91-xx.xx.xx.95

On Wed, Jun 28, 2017 at 5:16 PM, Akhil Mehra  wrote:

> nodetool repair has a trace option
>
> nodetool repair -tr yourkeyspacename
>
> see if that provides you with additional information.
>
> Regards,
> Akhil
>
> On 28/06/2017, at 2:25 AM, Balaji Venkatesan 
> wrote:
>
>
> We use Apache Cassandra 3.10-13
>
> On Jun 26, 2017 8:41 PM, "Michael Shuler"  wrote:
>
> What version of Cassandra?
>
> --
> Michael
>
> On 06/26/2017 09:53 PM, Balaji Venkatesan wrote:
> > Hi All,
> >
> > When I run nodetool repair on a keyspace I constantly get  "Some repair
> > failed" error, there are no sufficient info to debug more. Any help?
> >
> > Here is the stacktrace
> >
> > ==
> > [2017-06-27 02:44:34,275] Some repair failed
> > [2017-06-27 02:44:34,279] Repair command #3 finished in 33 seconds
> > error: Repair job has failed with the error message: [2017-06-27
> > 02:44:34,275] Some repair failed
> > -- StackTrace --
> > java.lang.RuntimeException: Repair job has failed with the error
> > message: [2017-06-27 02:44:34,275] Some repair failed
> > at org.apache.cassandra.tools.RepairRunner.progress(RepairRunne
> r.java:116)
> > at
> > org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.
> handleNotification(JMXNotificationProgressListener.java:77)
> > at
> > com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetche
> r.dispatchNotification(ClientNotifForwarde

Re: nodetool removenode causing the schema out of sync

2017-06-29 Thread Jai Bheemsen Rao Dhanwada
Hello Jeff,

Yes 2.1.16 is old version, and we are planning to upgrade in few months.

Only the gossiper info is logged stating that it marked several nodes down
and nothing else.


On Wed, Jun 28, 2017 at 8:15 PM, Jeff Jirsa  wrote:

>
>
> On 2017-06-28 18:51 (-0700), Jai Bheemsen Rao Dhanwada <
> jaibheem...@gmail.com> wrote:
> > Hello,
> >
> > We are using C* version 2.1.6 and lately we are seeing an issue where,
> > nodetool removenode causing the schema to go out of sync and causing
> client
> > to fail for 2-3 minutes.
> >
> > C* cluster is in 8 Datacenters with RF=3 and has 50 nodes.
> > We have 130 Keyspaces and 500 CF in the cluster.
> >
> > Here are the sequence of actions that were performed.
> >
> > 1. One node failed abruptly in the cluster due to hardware issue
> > 2. Remove the node from the cluster using nodetool removenode from a live
> > node.
> > 3. Immediately I see all the nodes schema go out of sync and on the logs
> of
> > all the C* nodes, I see they mark few other (random) nodes as down. and
> > eventually recover after 2 minutes
> >
> > Logs in the nodes:
> >
> > INFO  [GossipTasks:1] 2017-06-27 20:34:39,707 Gossiper.java:1008 -
> > InetAddress /10.10.10.20 is now DOWN
> > INFO  [GossipTasks:1] 2017-06-27 20:34:39,714 Gossiper.java:1008 -
> > InetAddress /10.10.11.14 is now DOWN
> >
> > Any one have an idea why, removenode causing the cluster to go out of
> sync?
> >
>
> That's not really expected - I've never seen behavior like that. However,
> 2.1.6 is pretty old (just about 2 years, give or take), there have been
> hundreds or (more likely) thousands of fixes since then.
>
> Is the gossiper line the only thing logged? Anything about invalid
> generations?
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: user-h...@cassandra.apache.org
>
>


Re: Cassandra Cluster Expansion Criteria

2017-06-29 Thread Anuj Wadehra
Hi Asad,
First, you need to understand the factors impacting cluster capacity. Some of 
the important factors to be considered while doing capacity planning of 
Cassandra are:
1.  Compaction strategy: It impacts disk space requirements and IO/CPU/memory 
overhead for compactions.
2. Replication Factor: Impacts disk space.
3. Business SLAs and Data Access patterns (read/write)
4. Type of storage: SSD will ensure that IO is rarely a problem. You may become 
CPU bound first.
Some trigger points for expanding your cluster:
1. Disk crunch. Unable to meet free disk requirements for various compaction 
strategies.
2. Overloaded nodes: tpstats/logs show  frequent dropped mutations. 
Consistently high CPU load.
3. Business SLAs not being met due to increase in reads/writes per second.
Please note that this is not an exhaustive list.
ThanksAnuj







Sent from Yahoo Mail on Android 
 
  On Thu, Jun 29, 2017 at 7:15 PM, ZAIDI, ASAD A wrote:

Hello Folks,
 
  
 
I’m on Cassandra 2.2.8 cluster with 14 nodes , each with around 2TB of data 
volume. I’m looking for a criteria /or data points that can help me decide when 
or  if I should add more nodes to the cluster and by how many nodes.
 
  
 
I’ll really appreciate if you guys can share your insights.
 
  
 
Thanks/Asad
 
  
 
  
 
  
 
  
   


jbod disk usage unequal

2017-06-29 Thread Micha
Hi,

I use a jbod setup (2 * 1TB) and the distribution is a little bit
unequal on my three nodes:
270MB and 540MB
150 and 580
290 and 500

SStable size varies between 2GB and 130GB.

Is is possible to move sstables from one disk to another to balance the
disk usage?
Otherwise is a raid-0 setup the only option for a balanced disk usage?


Thanks,
 Michael

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org



Re: Cassandra Cluster Expansion Criteria

2017-06-29 Thread Nitan Kainth
Ideally you should maintain 50% disk space.
SLA and Node load is also very important to make the decision.

> On Jun 29, 2017, at 6:45 AM, ZAIDI, ASAD A  wrote:
> 
> Hello Folks,
>  
> I’m on Cassandra 2.2.8 cluster with 14 nodes , each with around 2TB of data 
> volume. I’m looking for a criteria /or data points that can help me decide 
> when or  if I should add more nodes to the cluster and by how many nodes.
>  
> I’ll really appreciate if you guys can share your insights.
>  
> Thanks/Asad



Cassandra Cluster Expansion Criteria

2017-06-29 Thread ZAIDI, ASAD A
Hello Folks,

I’m on Cassandra 2.2.8 cluster with 14 nodes , each with around 2TB of data 
volume. I’m looking for a criteria /or data points that can help me decide when 
or  if I should add more nodes to the cluster and by how many nodes.

I’ll really appreciate if you guys can share your insights.

Thanks/Asad






Re: ALL range query monitors failing frequently

2017-06-29 Thread Matthew O'Riordan
Thanks Kurt, I appreciate that feedback.

I’ll investigate the metrics more fully and come back with my finding.

In terms of logs, I did look in the logs of the nodes and found nothing I
am afraid.

On Wed, Jun 28, 2017 at 11:33 PM, kurt greaves  wrote:

> I'd say that no, a range query probably isn't the best for monitoring, but
> it really depends on how important it is that the range you select is
> consistent.
>
> From those traces it does seem that the bulk of the time spent was waiting
> for responses from the replicas, which may indicate a network issue, but
> it's not conclusive evidence.
>
> For SSTables you could check the SSTables per read of the query, but it's
> unnecessary as the traces indicate that's not the issue. Might be worth
> trying to debug potential network issues. Might be worth looking into
> metrics like CoordinatorReadLatency and CoordinatorScanLatency at the table
> level https://cassandra.apache.org/doc/latest/
> operating/metrics.html#table-metrics
> Also if you have any network traffic metrics between nodes would be a good
> place to look.
>
> ​Other than that I'd look in the logs on each node when you run the trace
> and try and identify any errors that could be causing problems.
>



-- 

Regards,

Matthew O'Riordan
CEO who codes
Ably - simply better realtime 

*Ably News: Ably push notifications have gone live
*


CASSANDRA-12849

2017-06-29 Thread Jean Carlo
Hello

the jira 12849 has already a patch dispo. Might someone take a look of this
jira ?


https://issues.apache.org/jira/browse/CASSANDRA-12849

Saludos

Jean Carlo

"The best way to predict the future is to invent it" Alan Kay