[ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13960085#comment-13960085
 ] 

sankalp kohli commented on CASSANDRA-6747:
------------------------------------------

Please review v2 with your suggestions. 

> MessagingService should handle failures on remote nodes.
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6747
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: sankalp kohli
>            Priority: Minor
>              Labels: Core
>             Fix For: 2.1 beta2
>
>         Attachments: CASSANDRA-6747-v2.diff, CASSANDRA-6747.diff
>
>
> While going through the code of MessagingService, I discovered that we don't 
> handle callbacks on failure very well. If a Verb Handler on the remote 
> machine throws an exception, it goes right through uncaught exception 
> handler. The machine which triggered the message will keep waiting and will 
> timeout. On timeout, it will so some stuff hard coded in the MS like hints 
> and add to Latency. There is no way in IAsyncCallback to specify that to do 
> on timeouts and also on failures. 
> Here are some examples which I found will help if we enhance this system to 
> also propagate failures back.  So IAsyncCallback will have methods like 
> onFailure.
> 1) From ActiveRepairService.prepareForRepair
>    IAsyncCallback callback = new IAsyncCallback()
>        {
>            @Override
>            public void response(MessageIn msg)
>            {
>                prepareLatch.countDown();
>            }
>            @Override
>            public boolean isLatencyForSnitch()
>            {
>                return false;
>            }
>        };
>        List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
>        for (ColumnFamilyStore cfs : columnFamilyStores)
>            cfIds.add(cfs.metadata.cfId);
>        for(InetAddress neighbour : endpoints)
>        {
>            PrepareMessage message = new PrepareMessage(parentRepairSession, 
> cfIds, ranges);
>            MessageOut<RepairMessage> msg = message.createMessage();
>            MessagingService.instance().sendRR(msg, neighbour, callback);
>        }
>        try
>        {
>            prepareLatch.await(1, TimeUnit.HOURS);
>        }
>        catch (InterruptedException e)
>        {
>            parentRepairSessions.remove(parentRepairSession);
>            throw new RuntimeException("Did not get replies from all 
> endpoints.", e);
>        }
> 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
> exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to