elb...@host1:~$ dpkg -l|grep glusterfs
ii glusterfs-client 1.3.8-0pre2 GlusterFS fuse client ii glusterfs-server 1.3.8-0pre2 GlusterFS fuse server ii libglusterfs0 1.3.8-0pre2 GlusterFS libraries and translator modules

I have 2 hosts set up to use AFR with the package versions listed above. I have been experiencing an issue where a file that is copied to glusterfs is readable/writable for a while, then at some point it time, it ceases to be. Trying to access it only retrieves the error message, "cannot open `filename' for reading: Input/output error".

Files enter glusterfs either via the "cp" command from a client or via "rsync". In the case of cp, the clients are all local and copying across a very fast connection. In the case of rsync, the 1 client is itself a gluster client. We are testing out a later version of gluster, and it rsync's across a vpn.

elb...@host2:~$ dpkg -l|grep glusterfs
ii glusterfs-client 2.0.1-1 clustered file- system ii glusterfs-server 2.0.1-1 clustered file- system ii libglusterfs0 2.0.1-1 GlusterFS libraries and translator modules ii libglusterfsclient0 2.0.1-1 GlusterFS client library

=========
What causes files to become inaccessible? I read that fstat() had a bug in version 1.3.x whereas stat() did not, and that it was being worked on. Could this be related?

When a file becomes inaccessible, I have been manually removing the file from the mount point, then copying it back in via scp. Then the file becomes accessible. Below I've pasted a sample of what I'm seeing.

elb...@tool3.:hourlogs$ cd myDir
ls 1244682000.log
elb...@tool3.:myDir$ ls 1244682000.log
1244682000.log
elb...@tool3.:myDir$ stat 1244682000.log
  File: `1244682000.log'
  Size: 40265114        Blocks: 78744      IO Block: 4096   regular file
Device: 15h/21d Inode: 42205749    Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 1003/ elbert) Gid: ( 6000/ ops)
Access: 2009-06-11 02:25:10.000000000 +0000
Modify: 2009-06-11 02:26:02.000000000 +0000
Change: 2009-06-11 02:26:02.000000000 +0000
elb...@tool3.:myDir$ tail 1244682000.log
tail: cannot open `1244682000.log' for reading: Input/output error

At this point, I am able to rm the file. Then, if I scp it back in, I am able to successfully tail it.

So,

I have observed cases where the files had a Size of 0, and otherwise they were in the same state. I'm not totally certain, but it looks like if a file gets into this state from rsync, either it gets deposited in this state immediately (before I try to read it), or else it quickly enters this state. Speaking generally, file sizes tend to be several MB up to 150 MB.

Here's my server config:
# Gluster Server configuration /etc/glusterfs/glusterfs-server.vol
# Configured for AFR & Unify features

volume brick
 type storage/posix
 option directory /var/gluster/data/
end-volume

volume brick-ns
 type storage/posix
 option directory /var/gluster/ns/
end-volume

volume server
 type protocol/server
 option transport-type tcp/server
 subvolumes brick brick-ns
 option auth.ip.brick.allow 165.193.245.*,10.11.*
 option auth.ip.brick-ns.allow 165.193.245.*,10.11.*
end-volume

Here's my client config:
# Gluster Client configuration /etc/glusterfs/glusterfs-client.vol
# Configured for AFR & Unify features

volume brick1
 type protocol/client
 option transport-type tcp/client     # for TCP/IP transport
 option remote-host 10.11.16.68    # IP address of the remote brick
 option remote-subvolume brick        # name of the remote volume
end-volume

volume brick2
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.71
 option remote-subvolume brick
end-volume

volume brick3
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.69
 option remote-subvolume brick
end-volume

volume brick4
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.70
 option remote-subvolume brick
end-volume

volume brick5
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.119
 option remote-subvolume brick
end-volume

volume brick6
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.120
 option remote-subvolume brick
end-volume

volume brick-ns1
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.68
option remote-subvolume brick-ns # Note the different remote volume name.
end-volume

volume brick-ns2
 type protocol/client
 option transport-type tcp/client
 option remote-host 10.11.16.71
option remote-subvolume brick-ns # Note the different remote volume name.
end-volume

volume afr1
 type cluster/afr
 subvolumes brick1 brick2
end-volume

volume afr2
 type cluster/afr
 subvolumes brick3 brick4
end-volume

volume afr3
 type cluster/afr
 subvolumes brick5 brick6
end-volume

volume afr-ns
 type cluster/afr
 subvolumes brick-ns1 brick-ns2
end-volume

volume unify
 type cluster/unify
 subvolumes afr1 afr2 afr3
 option namespace afr-ns

 # use the ALU scheduler
 option scheduler alu

# This option makes brick5 to be readonly, where no new files are created.
 ##option alu.read-only-subvolumes brick5##

 # Don't create files one a volume with less than 5% free diskspace
 option alu.limits.min-free-disk  10%

 # Don't create files on a volume with more than 10000 files open
 option alu.limits.max-open-files 10000

# When deciding where to place a file, first look at the disk-usage, then at # read-usage, write-usage, open files, and finally the disk-speed- usage. option alu.order disk-usage:read-usage:write-usage:open-files- usage:disk-speed-usage

# Kick in if the discrepancy in disk-usage between volumes is more than 2GB
 option alu.disk-usage.entry-threshold 2GB

# Don't stop writing to the least-used volume until the discrepancy is 1988MB
 option alu.disk-usage.exit-threshold  60MB

 # Kick in if the discrepancy in open files is 1024
 option alu.open-files-usage.entry-threshold 1024

 # Don't stop until 992 files have been written the least-used volume
 option alu.open-files-usage.exit-threshold 32

 # Kick in when the read-usage discrepancy is 20%
 option alu.read-usage.entry-threshold 20%

 # Don't stop until the discrepancy has been reduced to 16% (20% - 4%)
 option alu.read-usage.exit-threshold 4%

 # Kick in when the write-usage discrepancy is 20%
 option alu.write-usage.entry-threshold 20%

## Don't stop until the discrepancy has been reduced to 16%
 option alu.write-usage.exit-threshold 4%

 # Refresh the statistics used for decision-making every 10 seconds
 option alu.stat-refresh.interval 10sec

# Refresh the statistics used for decision-making after creating 10 files
# option alu.stat-refresh.num-file-create 10
end-volume


#writebehind improves write performance a lot
volume writebehind
  type performance/write-behind
  option aggregate-size 131072 # in bytes
  subvolumes unify
end-volume

Has anyone seen this issue before? Any suggestions?

Thanks,
-elb-
_______________________________________________
Gluster-users mailing list
Gluster-users@gluster.org
http://zresearch.com/cgi-bin/mailman/listinfo/gluster-users

Reply via email to