[jira] [Updated] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.

2014-04-07 Thread Yuki Morishita (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Morishita updated CASSANDRA-6747:
--

Attachment: 6747-v3.txt

Thanks [~kohlisankalp], I updated your patch with following:

* MessageOut object is immutable and MessageOut#withParameter returns new 
object, so we have to use that instead of original.
* RTE throwed from ActiveRepairService#prepareForRepair has to be catched and 
notified to client so repair command not to hang.

For remote snapshot fail, the patch certainly catches the error on coordinator 
side, but it still hangs(marked as TODO in RepairJob#sendTreeRequest). This is 
handled in CASSANDRA-6455.

 MessagingService should handle failures on remote nodes.
 

 Key: CASSANDRA-6747
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: sankalp kohli
Priority: Minor
  Labels: Core
 Fix For: 2.1 beta2

 Attachments: 6747-v3.txt, CASSANDRA-6747-v2.diff, CASSANDRA-6747.diff


 While going through the code of MessagingService, I discovered that we don't 
 handle callbacks on failure very well. If a Verb Handler on the remote 
 machine throws an exception, it goes right through uncaught exception 
 handler. The machine which triggered the message will keep waiting and will 
 timeout. On timeout, it will so some stuff hard coded in the MS like hints 
 and add to Latency. There is no way in IAsyncCallback to specify that to do 
 on timeouts and also on failures. 
 Here are some examples which I found will help if we enhance this system to 
 also propagate failures back.  So IAsyncCallback will have methods like 
 onFailure.
 1) From ActiveRepairService.prepareForRepair
IAsyncCallback callback = new IAsyncCallback()
{
@Override
public void response(MessageIn msg)
{
prepareLatch.countDown();
}
@Override
public boolean isLatencyForSnitch()
{
return false;
}
};
ListUUID cfIds = new ArrayList(columnFamilyStores.size());
for (ColumnFamilyStore cfs : columnFamilyStores)
cfIds.add(cfs.metadata.cfId);
for(InetAddress neighbour : endpoints)
{
PrepareMessage message = new PrepareMessage(parentRepairSession, 
 cfIds, ranges);
MessageOutRepairMessage msg = message.createMessage();
MessagingService.instance().sendRR(msg, neighbour, callback);
}
try
{
prepareLatch.await(1, TimeUnit.HOURS);
}
catch (InterruptedException e)
{
parentRepairSessions.remove(parentRepairSession);
throw new RuntimeException(Did not get replies from all 
 endpoints., e);
}
 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
 exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.

2014-04-04 Thread Yuki Morishita (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Morishita updated CASSANDRA-6747:
--

Fix Version/s: 2.1 beta2

 MessagingService should handle failures on remote nodes.
 

 Key: CASSANDRA-6747
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: sankalp kohli
Priority: Minor
  Labels: Core
 Fix For: 2.1 beta2

 Attachments: CASSANDRA-6747.diff


 While going through the code of MessagingService, I discovered that we don't 
 handle callbacks on failure very well. If a Verb Handler on the remote 
 machine throws an exception, it goes right through uncaught exception 
 handler. The machine which triggered the message will keep waiting and will 
 timeout. On timeout, it will so some stuff hard coded in the MS like hints 
 and add to Latency. There is no way in IAsyncCallback to specify that to do 
 on timeouts and also on failures. 
 Here are some examples which I found will help if we enhance this system to 
 also propagate failures back.  So IAsyncCallback will have methods like 
 onFailure.
 1) From ActiveRepairService.prepareForRepair
IAsyncCallback callback = new IAsyncCallback()
{
@Override
public void response(MessageIn msg)
{
prepareLatch.countDown();
}
@Override
public boolean isLatencyForSnitch()
{
return false;
}
};
ListUUID cfIds = new ArrayList(columnFamilyStores.size());
for (ColumnFamilyStore cfs : columnFamilyStores)
cfIds.add(cfs.metadata.cfId);
for(InetAddress neighbour : endpoints)
{
PrepareMessage message = new PrepareMessage(parentRepairSession, 
 cfIds, ranges);
MessageOutRepairMessage msg = message.createMessage();
MessagingService.instance().sendRR(msg, neighbour, callback);
}
try
{
prepareLatch.await(1, TimeUnit.HOURS);
}
catch (InterruptedException e)
{
parentRepairSessions.remove(parentRepairSession);
throw new RuntimeException(Did not get replies from all 
 endpoints., e);
}
 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
 exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.

2014-04-04 Thread Yuki Morishita (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Morishita updated CASSANDRA-6747:
--

Reviewer: Yuki Morishita

[~kohlisankalp] I like your approach. 
One thing you need to change is in SnapshotTask's callback#onFailure, you can't 
just throw RuntimeException, you have to call task.setException so repair knows 
there's exception during snapshotting.

 MessagingService should handle failures on remote nodes.
 

 Key: CASSANDRA-6747
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: sankalp kohli
Priority: Minor
  Labels: Core
 Fix For: 2.1 beta2

 Attachments: CASSANDRA-6747.diff


 While going through the code of MessagingService, I discovered that we don't 
 handle callbacks on failure very well. If a Verb Handler on the remote 
 machine throws an exception, it goes right through uncaught exception 
 handler. The machine which triggered the message will keep waiting and will 
 timeout. On timeout, it will so some stuff hard coded in the MS like hints 
 and add to Latency. There is no way in IAsyncCallback to specify that to do 
 on timeouts and also on failures. 
 Here are some examples which I found will help if we enhance this system to 
 also propagate failures back.  So IAsyncCallback will have methods like 
 onFailure.
 1) From ActiveRepairService.prepareForRepair
IAsyncCallback callback = new IAsyncCallback()
{
@Override
public void response(MessageIn msg)
{
prepareLatch.countDown();
}
@Override
public boolean isLatencyForSnitch()
{
return false;
}
};
ListUUID cfIds = new ArrayList(columnFamilyStores.size());
for (ColumnFamilyStore cfs : columnFamilyStores)
cfIds.add(cfs.metadata.cfId);
for(InetAddress neighbour : endpoints)
{
PrepareMessage message = new PrepareMessage(parentRepairSession, 
 cfIds, ranges);
MessageOutRepairMessage msg = message.createMessage();
MessagingService.instance().sendRR(msg, neighbour, callback);
}
try
{
prepareLatch.await(1, TimeUnit.HOURS);
}
catch (InterruptedException e)
{
parentRepairSessions.remove(parentRepairSession);
throw new RuntimeException(Did not get replies from all 
 endpoints., e);
}
 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
 exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.

2014-04-04 Thread sankalp kohli (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sankalp kohli updated CASSANDRA-6747:
-

Attachment: CASSANDRA-6747-v2.diff

 MessagingService should handle failures on remote nodes.
 

 Key: CASSANDRA-6747
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Assignee: sankalp kohli
Priority: Minor
  Labels: Core
 Fix For: 2.1 beta2

 Attachments: CASSANDRA-6747-v2.diff, CASSANDRA-6747.diff


 While going through the code of MessagingService, I discovered that we don't 
 handle callbacks on failure very well. If a Verb Handler on the remote 
 machine throws an exception, it goes right through uncaught exception 
 handler. The machine which triggered the message will keep waiting and will 
 timeout. On timeout, it will so some stuff hard coded in the MS like hints 
 and add to Latency. There is no way in IAsyncCallback to specify that to do 
 on timeouts and also on failures. 
 Here are some examples which I found will help if we enhance this system to 
 also propagate failures back.  So IAsyncCallback will have methods like 
 onFailure.
 1) From ActiveRepairService.prepareForRepair
IAsyncCallback callback = new IAsyncCallback()
{
@Override
public void response(MessageIn msg)
{
prepareLatch.countDown();
}
@Override
public boolean isLatencyForSnitch()
{
return false;
}
};
ListUUID cfIds = new ArrayList(columnFamilyStores.size());
for (ColumnFamilyStore cfs : columnFamilyStores)
cfIds.add(cfs.metadata.cfId);
for(InetAddress neighbour : endpoints)
{
PrepareMessage message = new PrepareMessage(parentRepairSession, 
 cfIds, ranges);
MessageOutRepairMessage msg = message.createMessage();
MessagingService.instance().sendRR(msg, neighbour, callback);
}
try
{
prepareLatch.await(1, TimeUnit.HOURS);
}
catch (InterruptedException e)
{
parentRepairSessions.remove(parentRepairSession);
throw new RuntimeException(Did not get replies from all 
 endpoints., e);
}
 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
 exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (CASSANDRA-6747) MessagingService should handle failures on remote nodes.

2014-04-03 Thread sankalp kohli (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sankalp kohli updated CASSANDRA-6747:
-

Attachment: CASSANDRA-6747.diff

I am adding a new interface which could be used if we want negative ack back 
from the remote node or need a callback on timeout. 

This new interface is used in these two places. We can use this at other places 
but I have only updated two places.I can create new JIRA for these if that make 
more sense. 
SnapshotTask.java and ActiveRepairService.java

This also fixes the problem in the previous comment for these use cases. 

 MessagingService should handle failures on remote nodes.
 

 Key: CASSANDRA-6747
 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
 Project: Cassandra
  Issue Type: Improvement
Reporter: sankalp kohli
Priority: Minor
  Labels: Core
 Attachments: CASSANDRA-6747.diff


 While going through the code of MessagingService, I discovered that we don't 
 handle callbacks on failure very well. If a Verb Handler on the remote 
 machine throws an exception, it goes right through uncaught exception 
 handler. The machine which triggered the message will keep waiting and will 
 timeout. On timeout, it will so some stuff hard coded in the MS like hints 
 and add to Latency. There is no way in IAsyncCallback to specify that to do 
 on timeouts and also on failures. 
 Here are some examples which I found will help if we enhance this system to 
 also propagate failures back.  So IAsyncCallback will have methods like 
 onFailure.
 1) From ActiveRepairService.prepareForRepair
IAsyncCallback callback = new IAsyncCallback()
{
@Override
public void response(MessageIn msg)
{
prepareLatch.countDown();
}
@Override
public boolean isLatencyForSnitch()
{
return false;
}
};
ListUUID cfIds = new ArrayList(columnFamilyStores.size());
for (ColumnFamilyStore cfs : columnFamilyStores)
cfIds.add(cfs.metadata.cfId);
for(InetAddress neighbour : endpoints)
{
PrepareMessage message = new PrepareMessage(parentRepairSession, 
 cfIds, ranges);
MessageOutRepairMessage msg = message.createMessage();
MessagingService.instance().sendRR(msg, neighbour, callback);
}
try
{
prepareLatch.await(1, TimeUnit.HOURS);
}
catch (InterruptedException e)
{
parentRepairSessions.remove(parentRepairSession);
throw new RuntimeException(Did not get replies from all 
 endpoints., e);
}
 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
 exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)