empiredan commented on issue #2149:
URL:
https://github.com/apache/incubator-pegasus/issues/2149#issuecomment-2490312691
There are two problems that should be solved:
1. why the primary meta server failed with `segfault` while dropping tables ?
2. why all meta servers were never be restarted normally after the primary
meta server failed ?
To illustrate the reasons for both problems more clearly, I'll put here some
mechanisms about updating meta data. A pegasus cluster would flush security
policies to remote meta storage periodically (by
`update_ranger_policy_interval_sec`) in the form of environment variables. We
do this by `server_state::set_app_envs()`. However, after updating the meta
data on the remote meta storage (namely ZooKeeper), the table is not checked
that if it still exists while updating environment variables of local memory.
See the following code:
```C++
void server_state::set_app_envs(const app_env_rpc &env_rpc)
{
...
do_update_app_info(app_path, ainfo, [this, app_name, keys, values,
env_rpc](error_code ec) {
CHECK_EQ_MSG(ec, ERR_OK, "update app info to remote storage failed");
zauto_write_lock l(_lock);
std::shared_ptr<app_state> app = get_app(app_name);
std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',',
'=');
for (int idx = 0; idx < keys.size(); idx++) {
app->envs[keys[idx]] = values[idx];
}
std::string new_envs = dsn::utils::kv_map_to_string(app->envs, ',',
'=');
LOG_INFO("app envs changed: old_envs = {}, new_envs = {}", old_envs,
new_envs);
});
}
```
In `std::string old_envs = dsn::utils::kv_map_to_string(app->envs, ',',
'=');`, since `app` is `nullptr`, `app->envs` would point an invalid address,
leading to `segfault` in `libdsn_utils.so` where `dsn::utils::kv_map_to_string`
is.
Therefore, the reason for the 1st problem is very clear: the callback for
updating meta data on remove storage is called immediately after the table is
removed, and an invalid address is accessed due to null pointer.
Then, the meta server would load meta data from remote storage after it is
restart. However, the intermediate status `AS_DROPPING` is also flushed to
remote storage with security policies since all meta data for a table is a
unitary `json` object: the whole `json` would be set to remote storage once any
property is updated. However `AS_DROPPING` is invalid, and cannot pass the
assertion which would make meta server fail again and again, which is the
reason of the 2nd problem. See following code:
```C++
server_state::sync_apps_from_remote_storage()
{
...
std::shared_ptr<app_state> app = app_state::create(info);
{
zauto_write_lock l(_lock);
_all_apps.emplace(app->app_id, app);
if (app->status == app_status::AS_AVAILABLE) {
app->status = app_status::AS_CREATING;
_exist_apps.emplace(app->app_name, app);
_table_metric_entities.create_entity(app->app_id, app->partition_count);
} else if (app->status == app_status::AS_DROPPED) {
app->status = app_status::AS_DROPPING;
} else {
CHECK(false,
"invalid status({}) for app({}) in remote
storage",
enum_to_string(app->status),
app->get_logname());
}
}
...
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]