[ https://issues.apache.org/jira/browse/IMPALA-8904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933845#comment-16933845 ]
ASF subversion and git services commented on IMPALA-8904: --------------------------------------------------------- Commit b96b3b0b1ca97e5d756392a159e22dfcd8bcae71 in impala's branch refs/heads/master from Sahil Takiar [ https://gitbox.apache.org/repos/asf?p=impala.git;h=b96b3b0 ] IMPALA-8634: Catalog client should retry RPCs Add retries to catalogd RPCs. Previously, connection failures triggered a retry, but failures on the actual RPC did not trigger a retry. This change replaces all usages of ClientCache::DoRpc() in the CatalogOpExecutor with ClientCache::DoRpcWithRetry(). This change moves the connection retry loop to DoRpcWithRetry(), instead of relying on the ClientCache to retry the connection. This patch is based to IMPALA-8904, which adds similar functionality to statestore RPCs. Testing: * Renamed test_statestore_rpc_errors.py to test_services_rpc_errors.py and added new tests for catalogd RPC errors * Added new tests to test_restart_services.py * Ran core tests Change-Id: I7f33ad2b36d301fb64e70a939e71decab0ca993c Reviewed-on: http://gerrit.cloudera.org:8080/14246 Reviewed-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Tested-by: Impala Public Jenkins <impala-public-jenk...@cloudera.com> > Daemons fails fast when statestore has not started up > ----------------------------------------------------- > > Key: IMPALA-8904 > URL: https://issues.apache.org/jira/browse/IMPALA-8904 > Project: IMPALA > Issue Type: Bug > Components: Distributed Exec > Affects Versions: Impala 3.1.0, Impala 3.2.0, Impala 3.3.0 > Reporter: Tim Armstrong > Assignee: Tim Armstrong > Priority: Major > Fix For: Impala 3.4.0 > > > If you start the statestored and the other services at the same time, there > is a race between the statestore starting and the other services trying to > register with it. If the other services "win" the race, they abort startup > because they can't register with the statestore. > The log looks like. > {noformat} > │ I0828 00:19:10.460000 1 statestore-subscriber.cc:219] Starting > statestore subscriber > > ││ I0828 > 00:19:10.461310 1 thrift-server.cc:451] ThriftServer > 'StatestoreSubscriber' started on port: 23000 > > │ > │ I0828 00:19:10.461320 1 statestore-subscriber.cc:247] Registering with > statestore > > ││ I0828 00:19:10.461309 > 299 TAcceptQueueServer.cpp:314] connection_setup_thread_pool_size is set to 2 > > > │ > │ I0828 00:19:10.462744 1 statestore-subscriber.cc:253] statestore > registration unsuccessful: RPC Error: Client for statestored:24000 hit an > unexpected exception: No more data to read., type: > N6apache6thrift9transport19TTransportExceptionE, rpc: > N6impala27TRegisterSubscriberRe ││ sponseE, send: done > > > > │ > │ E0828 00:19:10.462818 1 impalad-main.cc:90] Impalad services did not > start correctly, exiting. Error: RPC Error: Client for statestored:24000 hit > an unexpected exception: No more data to read., type: > N6apache6thrift9transport19TTransportExceptionE, rpc: N6impala27TRegisterS ││ > ubscriberResponseE, send: done > > > │ > │ Statestore subscriber did not start up. > > {noformat} > Most management systems will automatically restart failed processes, so > typically the impalads will come back up and find the statestore, but the > crash loop is unnecessary. > I propose that the services should retry for a while before giving up (we > still want the services to fail when there genuinely isn't a statestore > available). -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org