Boa tarde pessoal.

Hje de manhã tivemos "too many clients" no banco, eu não esta na empresa, e
o adm de redes foi lá e derrubou um monte de conexões do postgres que ele
achou q eram antigas...

O banco ficou inacessível, ele fez um restart do banco, que não subiu. Teve
que apagar o PID na unha e depois o banco subiu...

Depois disso, quando cheguei, notei que o banco estava se derrubando e
subindo sozinho, exibindo essas mensagens:

* 2013-09-26 12:09:25 BRT [18539]: [1-1] db=,user= LOG:  server process
(PID 23040) was terminated by signal 6*
* 2013-09-26 12:09:25 BRT [18539]: [2-1] db=,user= LOG:  terminating any
other active server processes*
*10.11.0.2 2013-09-26 12:09:25 BRT [23043]: [3-1] db=cimed,user=postgres
WARNING:  terminating connection because of crash of another server process*
*10.11.0.2 2013-09-26 12:09:25 BRT [23043]: [4-1] db=cimed,user=postgres
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.*
*10.11.0.2 2013-09-26 12:09:25 BRT [23043]: [5-1] db=cimed,user=postgres
HINT:  In a moment you should be able to reconnect to the database and
repeat your command.*


Quando subia, esse era o log:

* 2013-09-26 12:09:25 BRT [18539]: [3-1] db=,user= LOG:  all server
processes terminated; reinitializing*
* 2013-09-26 12:09:26 BRT [23047]: [1-1] db=,user= LOG:  database system
was interrupted at 2013-09-26 12:04:32 BRT*
* 2013-09-26 12:09:26 BRT [23047]: [2-1] db=,user= LOG:  checkpoint record
is at 160/370DEA58*
* 2013-09-26 12:09:26 BRT [23047]: [3-1] db=,user= LOG:  redo record is at
160/370D18C8; undo record is at 0/0; shutdown FALSE*
* 2013-09-26 12:09:26 BRT [23047]: [4-1] db=,user= LOG:  next transaction
ID: 499844432; next OID: 572978777*
* 2013-09-26 12:09:26 BRT [23047]: [5-1] db=,user= LOG:  next MultiXactId:
15762; next MultiXactOffset: 37493*
* 2013-09-26 12:09:26 BRT [23047]: [6-1] db=,user= LOG:  database system
was not properly shut down; automatic recovery in progress*
* 2013-09-26 12:09:26 BRT [23047]: [7-1] db=,user= LOG:  redo starts at
160/370D18C8*
* 2013-09-26 12:09:26 BRT [23047]: [8-1] db=,user= LOG:  record with zero
length at 160/3768AD90*
* 2013-09-26 12:09:26 BRT [23047]: [9-1] db=,user= LOG:  redo done at
160/3768AD60*
* 2013-09-26 12:09:33 BRT [23047]: [10-1] db=,user= LOG:  database system
is ready*
* 2013-09-26 12:09:33 BRT [23047]: [11-1] db=,user= LOG:  transaction ID
wrap limit is 1073777089, limited by database "cimed"*

e sempre precedido dessas msg´s ( note que tive varias ocorrencias dela)

*10.11.0.2 2013-09-26 12:09:24 BRT [23040]: [3-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 12:25:35 BRT [23843]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 12:29:26 BRT [24116]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 12:44:34 BRT [25066]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 13:34:21 BRT [28222]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 13:47:51 BRT [29590]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 14:03:48 BRT [30643]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 14:26:33 BRT [31689]: [206-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 14:30:27 BRT [31902]: [9-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 14:50:00 BRT [924]: [127-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 14:55:26 BRT [1985]: [5-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 15:05:46 BRT [3063]: [15-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*
*
*além da informação de roll back das transações: *
*
10.11.0.2 2013-09-26 17:25:11 BRT [11831]: [4-1] db=nutracom,user=visao
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
10.35.0.2 2013-09-26 17:25:11 BRT [11982]: [2-1] db=cimed,user=postgres
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
10.11.0.2 2013-09-26 17:25:11 BRT [13011]: [2-1] db=cimed,user=postgres
DETAIL:  The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

*

Pesquisando, vi que poderia ser corrupção de indices...

Derrubei o banco, limitei o acesso dos usuários, e executei o reindex de
todas as tabelas em lote, com script.

Durante esse processo, tive o mesmo problema duas vezes, qdo o indice
chegou numa determinada tabela, ao invés de executar o script em lote, fiz
tabela a tabela, e passou do ponto que dava erro.

O reindex de todas as tabelas terminou, e subi o banco novamente...

Duas horas depois, a mesma coisa com o aumento do acesso:
*10.11.0.2 2013-09-26 17:10:37 BRT [11516]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*10.11.0.2 2013-09-26 17:25:10 BRT [13163]: [1-1] db=cimed,user=postgres
PANIC:  right sibling's left-link doesn't match*
*
*
*Inclusive essa mensagem me preocupou e não tenho idéia do que pode ser:*
*
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [1-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937579: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [2-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937581: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [3-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937583: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [4-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937585: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [5-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937586: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [6-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937588: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [7-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937590: Arquivo ou
diretório não encontrado
10.35.0.2 2013-09-26 18:01:28 BRT [16124]: [8-1] db=nutracom,user=postgres
WARNING:  could not remove relation 1663/105809227/572937597: Arquivo ou
diretório não encontrado

*
Não estou certo de como proceder: dump/ restore do banco, drop/create dos
indices, ou alguma outra tentativa:

Servidor Linux, Postgresql 8.1.18, esse é o servidor de produção que está
estável, com espaço em disco e memória sobrando.
Tenho uma unica instancia do postgres com vários databases.

Poderiam me ajudar?

No aguardo,
_______________________________________________
pgbr-geral mailing list
pgbr-geral@listas.postgresql.org.br
https://listas.postgresql.org.br/cgi-bin/mailman/listinfo/pgbr-geral

Responder a