[jira] [Updated] (TS-3104) traffic_cop can't restart traffic_manager properly
[ https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Phil Sorber updated TS-3104: Fix Version/s: (was: 5.3.0) 6.0.0 > traffic_cop can't restart traffic_manager properly > -- > > Key: TS-3104 > URL: https://issues.apache.org/jira/browse/TS-3104 > Project: Traffic Server > Issue Type: Bug > Components: Cop >Reporter: Victor >Assignee: James Peach > Fix For: 6.0.0 > > Attachments: ts-0022-fix-lockfile-killgroup.patch, > ts-0023-cop-reinit-mgr-api-on-failure.patch > > > In some cases traffic_cop can't restart traffic_manager properly. We met > these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are > two places in code which in my opinion need corrections: > 1) The logic which decides whether to kill process or group. > 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of > failure and this fact leads to constant attempts to connect to manager using > socket id == -1. > I have prepared patches for both issues. Please kindly take a look at them > and let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TS-3104) traffic_cop can't restart traffic_manager properly
[ https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susan Hinrichs updated TS-3104: --- Assignee: James Peach > traffic_cop can't restart traffic_manager properly > -- > > Key: TS-3104 > URL: https://issues.apache.org/jira/browse/TS-3104 > Project: Traffic Server > Issue Type: Bug > Components: Cop >Reporter: Victor >Assignee: James Peach > Fix For: 5.3.0 > > Attachments: ts-0022-fix-lockfile-killgroup.patch, > ts-0023-cop-reinit-mgr-api-on-failure.patch > > > In some cases traffic_cop can't restart traffic_manager properly. We met > these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are > two places in code which in my opinion need corrections: > 1) The logic which decides whether to kill process or group. > 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of > failure and this fact leads to constant attempts to connect to manager using > socket id == -1. > I have prepared patches for both issues. Please kindly take a look at them > and let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TS-3104) traffic_cop can't restart traffic_manager properly
[ https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Susan Hinrichs updated TS-3104: --- Fix Version/s: (was: 5.2.0) 5.3.0 > traffic_cop can't restart traffic_manager properly > -- > > Key: TS-3104 > URL: https://issues.apache.org/jira/browse/TS-3104 > Project: Traffic Server > Issue Type: Bug > Components: Cop >Reporter: Victor > Fix For: 5.3.0 > > Attachments: ts-0022-fix-lockfile-killgroup.patch, > ts-0023-cop-reinit-mgr-api-on-failure.patch > > > In some cases traffic_cop can't restart traffic_manager properly. We met > these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are > two places in code which in my opinion need corrections: > 1) The logic which decides whether to kill process or group. > 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of > failure and this fact leads to constant attempts to connect to manager using > socket id == -1. > I have prepared patches for both issues. Please kindly take a look at them > and let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TS-3104) traffic_cop can't restart traffic_manager properly
[ https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leif Hedstrom updated TS-3104: -- Fix Version/s: 5.2.0 > traffic_cop can't restart traffic_manager properly > -- > > Key: TS-3104 > URL: https://issues.apache.org/jira/browse/TS-3104 > Project: Traffic Server > Issue Type: Bug > Components: Cop >Reporter: Victor > Fix For: 5.2.0 > > Attachments: ts-0022-fix-lockfile-killgroup.patch, > ts-0023-cop-reinit-mgr-api-on-failure.patch > > > In some cases traffic_cop can't restart traffic_manager properly. We met > these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are > two places in code which in my opinion need corrections: > 1) The logic which decides whether to kill process or group. > 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of > failure and this fact leads to constant attempts to connect to manager using > socket id == -1. > I have prepared patches for both issues. Please kindly take a look at them > and let me know your thoughts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TS-3104) traffic_cop can't restart traffic_manager properly
[ https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor updated TS-3104: --- Description: In some cases traffic_cop can't restart traffic_manager properly. We met these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are two places in code which in my opinion need corrections: 1) The logic which decides whether to kill process or group. 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of failure and this fact leads to constant attempts to connect to manager using socket id == -1. I have prepared patches for both issues. Please kindly take a look at them and let me know your thoughts. was: In some cases traffic_cop can't restart traffic_manager properly. We met these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are two places in code which in my opinion need corrections: 1) The logic which decides whether to kill process or group. 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of failure and this fact leads to constant attempts to connect to manager using socket id == -1. I have prepared patches for both issues. Please kindly take a look at them and let me know your thoughts. diff --git lib/ts/lockfile.cc lib/ts/lockfile.cc index f6e9587..dbd7394 100644 --- lib/ts/lockfile.cc +++ lib/ts/lockfile.cc @@ -241,6 +241,7 @@ Lockfile::KillGroup(int sig, int initial_sig, const char *pname) int err; pid_t pid; pid_t holding_pid; + pid_t self = getpid(); err = Open(&holding_pid); if (err == 1) // success getting the lock file @@ -252,12 +253,20 @@ Lockfile::KillGroup(int sig, int initial_sig, const char *pname) pid = getpgid(holding_pid); } while ((pid < 0) && (errno == EINTR)); -if ((pid < 0) || (pid == getpid())) +if ((pid < 0) || (pid == self)) { + // Error getting process group, + // or we are the group's owner. + // Let's kill just holding_pid pid = holding_pid; - -if (pid != 0) { +} +else if (pid != self) { + // We managed to get holding_pid's process group + // and this group is not ours. // This way, we kill the process_group: pid = -pid; +} + +if (pid != 0) { // In order to get core files from each process, please // set your core_pattern appropriately. lockfile_kill_internal(holding_pid, initial_sig, pid, pname, sig); diff --git cop/TrafficCop.cc cop/TrafficCop.cc index 307270e..56bc6d2 100644 --- cop/TrafficCop.cc +++ cop/TrafficCop.cc @@ -59,6 +59,7 @@ static const char COP_TRACE_FILE[] = "/tmp/traffic_cop.trace"; #define COP_FATALLOG_ALERT #define COP_WARNING LOG_ERR +#define COP_INFO LOG_INFO #define COP_DEBUGLOG_DEBUG Diags * g_diags; // link time dependency @@ -131,6 +132,9 @@ static int child_pid = 0; static int child_status = 0; static int sem_id = 11452; +// manager API is initialized +static bool mgmt_init = false; + AppVersionInfo appVersionInfo; static char const localhost[] = "127.0.0.1"; @@ -1142,6 +1146,7 @@ test_mgmt_cli_port() if (TSRecordGetString("proxy.config.manager_binary", &val) != TS_ERR_OKAY) { cop_log(COP_WARNING, "(cli test) unable to retrieve manager_binary\n"); +mgmt_init = false; ret = -1; } else { if (strcmp(val, manager_binary) != 0) { @@ -1544,7 +1549,6 @@ check_no_run() static void* check(void *arg) { - bool mgmt_init = false; cop_log_trace("Entering check()\n"); for (;;) { @@ -1593,6 +1597,7 @@ check(void *arg) // We do this after the first round of checks, since the first "check" will spawn traffic_manager if (!mgmt_init) { + cop_log(COP_INFO, "Initializing manager API\n"); TSInit(Layout::get()->runtimedir, static_cast(TS_MGMT_OPT_NO_EVENTS | TS_MGMT_OPT_NO_SOCK_TESTS)); mgmt_init = true; } > traffic_cop can't restart traffic_manager properly > -- > > Key: TS-3104 > URL: https://issues.apache.org/jira/browse/TS-3104 > Project: Traffic Server > Issue Type: Bug > Components: Cop >Reporter: Victor > Attachments: ts-0022-fix-lockfile-killgroup.patch, > ts-0023-cop-reinit-mgr-api-on-failure.patch > > > In some cases traffic_cop can't restart traffic_manager properly. We met > these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are > two places in code which in my opinion need corrections: > 1) The logic which decides whether to kill process or group. > 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of > failure and this fact leads to constant attempts to connect to manager using > socket id == -1. > I have prepared patches for both issues. Please kindly take a look at them > and let me know your thoughts. -- This message w
[jira] [Updated] (TS-3104) traffic_cop can't restart traffic_manager properly
[ https://issues.apache.org/jira/browse/TS-3104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Victor updated TS-3104: --- Attachment: ts-0023-cop-reinit-mgr-api-on-failure.patch ts-0022-fix-lockfile-killgroup.patch Patches for described issues. > traffic_cop can't restart traffic_manager properly > -- > > Key: TS-3104 > URL: https://issues.apache.org/jira/browse/TS-3104 > Project: Traffic Server > Issue Type: Bug > Components: Cop >Reporter: Victor > Attachments: ts-0022-fix-lockfile-killgroup.patch, > ts-0023-cop-reinit-mgr-api-on-failure.patch > > > In some cases traffic_cop can't restart traffic_manager properly. We met > these issues at "Ashmanov and partners" (http://en.ashmanov.com/). There are > two places in code which in my opinion need corrections: > 1) The logic which decides whether to kill process or group. > 2) The main traffic_cop loop: it doesn't reinitialize manager API in case of > failure and this fact leads to constant attempts to connect to manager using > socket id == -1. > I have prepared patches for both issues. Please kindly take a look at them > and let me know your thoughts. > diff --git lib/ts/lockfile.cc lib/ts/lockfile.cc > index f6e9587..dbd7394 100644 > --- lib/ts/lockfile.cc > +++ lib/ts/lockfile.cc > @@ -241,6 +241,7 @@ Lockfile::KillGroup(int sig, int initial_sig, const char > *pname) >int err; >pid_t pid; >pid_t holding_pid; > + pid_t self = getpid(); > >err = Open(&holding_pid); >if (err == 1) // success getting the lock file > @@ -252,12 +253,20 @@ Lockfile::KillGroup(int sig, int initial_sig, const > char *pname) >pid = getpgid(holding_pid); > } while ((pid < 0) && (errno == EINTR)); > > -if ((pid < 0) || (pid == getpid())) > +if ((pid < 0) || (pid == self)) { > + // Error getting process group, > + // or we are the group's owner. > + // Let's kill just holding_pid >pid = holding_pid; > - > -if (pid != 0) { > +} > +else if (pid != self) { > + // We managed to get holding_pid's process group > + // and this group is not ours. >// This way, we kill the process_group: >pid = -pid; > +} > + > +if (pid != 0) { >// In order to get core files from each process, please >// set your core_pattern appropriately. >lockfile_kill_internal(holding_pid, initial_sig, pid, pname, sig); > diff --git cop/TrafficCop.cc cop/TrafficCop.cc > index 307270e..56bc6d2 100644 > --- cop/TrafficCop.cc > +++ cop/TrafficCop.cc > @@ -59,6 +59,7 @@ static const char COP_TRACE_FILE[] = > "/tmp/traffic_cop.trace"; > > #define COP_FATALLOG_ALERT > #define COP_WARNING LOG_ERR > +#define COP_INFO LOG_INFO > #define COP_DEBUGLOG_DEBUG > > Diags * g_diags; // link time dependency > @@ -131,6 +132,9 @@ static int child_pid = 0; > static int child_status = 0; > static int sem_id = 11452; > > +// manager API is initialized > +static bool mgmt_init = false; > + > AppVersionInfo appVersionInfo; > > static char const localhost[] = "127.0.0.1"; > @@ -1142,6 +1146,7 @@ test_mgmt_cli_port() > >if (TSRecordGetString("proxy.config.manager_binary", &val) != > TS_ERR_OKAY) { > cop_log(COP_WARNING, "(cli test) unable to retrieve manager_binary\n"); > +mgmt_init = false; > ret = -1; >} else { > if (strcmp(val, manager_binary) != 0) { > @@ -1544,7 +1549,6 @@ check_no_run() > static void* > check(void *arg) > { > - bool mgmt_init = false; >cop_log_trace("Entering check()\n"); > >for (;;) { > @@ -1593,6 +1597,7 @@ check(void *arg) > > // We do this after the first round of checks, since the first "check" > will spawn traffic_manager > if (!mgmt_init) { > + cop_log(COP_INFO, "Initializing manager API\n"); >TSInit(Layout::get()->runtimedir, > static_cast(TS_MGMT_OPT_NO_EVENTS | > TS_MGMT_OPT_NO_SOCK_TESTS)); >mgmt_init = true; > } -- This message was sent by Atlassian JIRA (v6.3.4#6332)