[ https://issues.apache.org/jira/browse/IMPALA-9295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sahil Takiar resolved IMPALA-9295. ---------------------------------- Fix Version/s: Impala 3.4.0 Resolution: Fixed > RPC failures don't always trigger a blacklist > --------------------------------------------- > > Key: IMPALA-9295 > URL: https://issues.apache.org/jira/browse/IMPALA-9295 > Project: IMPALA > Issue Type: Bug > Components: Backend > Reporter: Sahil Takiar > Assignee: Sahil Takiar > Priority: Major > Fix For: Impala 3.4.0 > > > There is a race condition in IMPALA-9137. It is possible for the > aux_error_info and the failure status to arrive in separate exec status > reports. > IMPALA-9137 added AuxErrorInfoPB to FragmentInstanceExecStatusPB (contains > per-fragment info for a ReportExecStatusRequestPB). The idea is that if a > query fails, the Coordinator would use the AuxErrorInfoPB to potentially > blacklist any nodes that caused the failure. The Coordinator only looks for > AuxErrorInfoPB if the query has failed (e.g. if > ReportExecStatusRequestPB::overall_status is set to an error). > The issue is that is is possible that the AuxErrorInfoPB is set even though > overall_status == OK. There is a race condition on the executor side where > the setting of the aux_error_info and and overall_status is not synchronized. > So if a fragment fails due to an RPC error, it is possible for report number > "x" to include the aux_error_info with overall_status == OK, and report > number "x + 1" to include no aux_error_info with overall_status == > [some-RPC-failure-message]. > Report num "x +1" won't include the aux_error_info since the fragment has > finished and its last FragmentInstanceExecStatusPB was in report num "x". -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-all-unsubscr...@impala.apache.org For additional commands, e-mail: issues-all-h...@impala.apache.org