Paul Guo created HAWQ-1335: ------------------------------ Summary: Need to refactor some QD error handling code. Key: HAWQ-1335 URL: https://issues.apache.org/jira/browse/HAWQ-1335 Project: Apache HAWQ Issue Type: Bug Components: Dispatcher Reporter: Paul Guo Assignee: Ed Espino
While recently I worked on QD related issues I found the QD error handling code is really confusing. At least: 1) dispmgt_thread_func_run() /* * Cleanup rules: * 1. query cancel, result error, and poll error: mark the executor stop. * 2. connection error: mark the gang error. Set by workermgr_mark_executor_error(). */ I do not think the code is handling like this. Maybe I did not understand related details. Besides workermgr_mark_executor_error() does not exist also. 2) In executormgr_cancel() #if 0 if (success) { executor->state = QES_STOP; executor->health = QEH_CANCEL; } else { /* TODO: log error? how to deal with connection error. */ executormgr_catch_error(executor); } #endif { write_log("function executormgr_cancel calling executormgr_catch_error"); executormgr_catch_error(executor); } Why executormgr_catch_error() is called for all cases? and whether "success" is not enough to judge whether executormgr_catch_error() should be called or not. 3) cdbdisp_seterrcode() if (!dispatchResult->errcode) { dispatchResult->errcode = (errcode == 0) ? ERRCODE_INTERNAL_ERROR : errcode; if (resultIndex >= 0) dispatchResult->errindex = resultIndex; } Why need to set ERRCODE_INTERNAL_ERROR while leave meleeResults->err code alone. This piece of code seems to be totally redundant. 4) It is splitting general errors with normal cancel kinda of interrupts. However I still see error code below related to cancellation. if (errcode == ERRCODE_GP_OPERATION_CANCELED || errcode == ERRCODE_QUERY_CANCELED) Is it possible to combine them to just error code only. 5) dispmgt_thread_func_run() should really set error code for each error_cleanup cases. 6) dispmgt_thread_func_run() could quit earlier not just because QE, e.g. lack of memory on QD so QD fails in some system calls with ENOMEM, while cdbdisp_seterrcode() seems to need a related executor. 7) Probably need to rewrite some of the logs and the comments. -- This message was sent by Atlassian JIRA (v6.3.15#6346)