[ 
https://issues.apache.org/jira/browse/CASSANDRA-6747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuki Morishita updated CASSANDRA-6747:
--------------------------------------

    Attachment: 6747-v3.txt

Thanks [~kohlisankalp], I updated your patch with following:

* MessageOut object is immutable and MessageOut#withParameter returns new 
object, so we have to use that instead of original.
* RTE throwed from ActiveRepairService#prepareForRepair has to be catched and 
notified to client so repair command not to hang.

For remote snapshot fail, the patch certainly catches the error on coordinator 
side, but it still hangs(marked as TODO in RepairJob#sendTreeRequest). This is 
handled in CASSANDRA-6455.

> MessagingService should handle failures on remote nodes.
> --------------------------------------------------------
>
>                 Key: CASSANDRA-6747
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6747
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: sankalp kohli
>            Assignee: sankalp kohli
>            Priority: Minor
>              Labels: Core
>             Fix For: 2.1 beta2
>
>         Attachments: 6747-v3.txt, CASSANDRA-6747-v2.diff, CASSANDRA-6747.diff
>
>
> While going through the code of MessagingService, I discovered that we don't 
> handle callbacks on failure very well. If a Verb Handler on the remote 
> machine throws an exception, it goes right through uncaught exception 
> handler. The machine which triggered the message will keep waiting and will 
> timeout. On timeout, it will so some stuff hard coded in the MS like hints 
> and add to Latency. There is no way in IAsyncCallback to specify that to do 
> on timeouts and also on failures. 
> Here are some examples which I found will help if we enhance this system to 
> also propagate failures back.  So IAsyncCallback will have methods like 
> onFailure.
> 1) From ActiveRepairService.prepareForRepair
>    IAsyncCallback callback = new IAsyncCallback()
>        {
>            @Override
>            public void response(MessageIn msg)
>            {
>                prepareLatch.countDown();
>            }
>            @Override
>            public boolean isLatencyForSnitch()
>            {
>                return false;
>            }
>        };
>        List<UUID> cfIds = new ArrayList<>(columnFamilyStores.size());
>        for (ColumnFamilyStore cfs : columnFamilyStores)
>            cfIds.add(cfs.metadata.cfId);
>        for(InetAddress neighbour : endpoints)
>        {
>            PrepareMessage message = new PrepareMessage(parentRepairSession, 
> cfIds, ranges);
>            MessageOut<RepairMessage> msg = message.createMessage();
>            MessagingService.instance().sendRR(msg, neighbour, callback);
>        }
>        try
>        {
>            prepareLatch.await(1, TimeUnit.HOURS);
>        }
>        catch (InterruptedException e)
>        {
>            parentRepairSessions.remove(parentRepairSession);
>            throw new RuntimeException("Did not get replies from all 
> endpoints.", e);
>        }
> 2) During snapshot phase in repair, if SnapshotVerbHandler throws an 
> exception, we will wait forever. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to