[ https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sankalp kohli reassigned CASSANDRA-6747: ---------------------------------------- Assignee: sankalp kohli > MessagingService should handle failures on remote nodes. > -------------------------------------------------------- > > Key: CASSANDRA-6747 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6747 > Project: Cassandra > Issue Type: Improvement > Reporter: sankalp kohli > Assignee: sankalp kohli > Priority: Minor > Labels: Core > Attachments: CASSANDRA-6747.diff > > > While going through the code of MessagingService, I discovered that we don't > handle callbacks on failure very well. If a Verb Handler on the remote > machine throws an exception, it goes right through uncaught exception > handler. The machine which triggered the message will keep waiting and will > timeout. On timeout, it will so some stuff hard coded in the MS like hints > and add to Latency. There is no way in IAsyncCallback to specify that to do > on timeouts and also on failures. > Here are some examples which I found will help if we enhance this system to > also propagate failures back. So IAsyncCallback will have methods like > onFailure. > 1) From ActiveRepairService.prepareForRepair > IAsyncCallback callback = new IAsyncCallback() > { > @Override > public void response(MessageIn msg) > { > prepareLatch.countDown(); > } > @Override > public boolean isLatencyForSnitch() > { > return false; > } > }; > List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size()); > for (ColumnFamilyStore cfs : columnFamilyStores) > cfIds.add(cfs.metadata.cfId); > for(InetAddress neighbour : endpoints) > { > PrepareMessage message = new PrepareMessage(parentRepairSession, > cfIds, ranges); > MessageOut<RepairMessage> msg = message.createMessage(); > MessagingService.instance().sendRR(msg, neighbour, callback); > } > try > { > prepareLatch.await(1, TimeUnit.HOURS); > } > catch (InterruptedException e) > { > parentRepairSessions.remove(parentRepairSession); > throw new RuntimeException("Did not get replies from all > endpoints.", e); > } > 2) During snapshot phase in repair, if SnapshotVerbHandler throws an > exception, we will wait forever. -- This message was sent by Atlassian JIRA (v6.2#6252)