Further to what I wrote before:
gluster server overload; recovers, now "Transport endpoint is not
connected" for some files

I'm getting conflicting info here.  On one hand, the peer that had its
glusterfsd  lock up seems to be in the gluster system, according to
the frequently referenced 'gluster peer status'

Thu Aug 02 15:48:46 [1.00 0.89 0.92]  root@pbs1:~
729 $ gluster peer status
Number of Peers: 3

Hostname: pbs4ib
Uuid: 2a593581-bf45-446c-8f7c-212c53297803
State: Peer in Cluster (Connected)

Hostname: pbs2ib
Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077
State: Peer in Cluster (Connected)

Hostname: pbs3ib
Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42
State: Peer in Cluster (Connected)

On the other hand, some errors that I provided yesterday:
[2012-08-01 18:07:26.104910] W
[dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes
down -- not fixing

as well as this information:
$ gluster volume status all detail

[top 2 brick stanzas trimmed; they're online]
Brick                : Brick pbs3ib:/bducgl
Port                 : 24018
Online               : N                   <<=====================
Pid                  : 20953
File System          : xfs
Device               : /dev/md127
Mount Options        : rw
Inode Size           : 256
Disk Space Free      : 6.1TB
Total Disk Space     : 8.2TB
Inode Count          : 1758158080
Free Inodes          : 1752326373
Brick                : Brick pbs4ib:/bducgl
Port                 : 24009
Online               : Y
Pid                  : 20948
File System          : xfs
Device               : /dev/sda
Mount Options        : rw
Inode Size           : 256
Disk Space Free      : 4.6TB
Total Disk Space     : 6.4TB
Inode Count          : 1367187392
Free Inodes          : 1361305613

The above implies fairly strongly that the brick did not re-establish
connection to the volume, altho the gluster peer info did.

Strangely enough, when I RE-restarted the glusterd, it DID come back
and re-joined the gluster volume and now the (restarted) fix-layout
job is proceeding without those  "subvolumes
down -- not fixing" errors, just a steady stream of 'found
anomalies/fixing the layout' messages, tho at the rate that it's going
it looks like it will take several days.

Still better several days to fix the data on-disk and having the fs
live than having to tell users that their data is gone and then having
to rebuild from zero.  Luckily, it's officially a /scratch filesystem.


Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
415 South Circle View Dr, Irvine, CA, 92697 [shipping]
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
Gluster-users mailing list

Reply via email to