Hi Nicolas, On Tue, Feb 3, 2009 at 12:01 AM, nicolas prochazka < prochazka.nico...@gmail.com> wrote:
> I inspect the log and i find something interesting : > All is ok, > i have stop 10.98.98.2 and i restart it : > > 2009-02-02 15:00:32 D [client-protocol.c:6498:notify] brick_10.98.98.2: got > GF_EVENT_CHILD_UP > 2009-02-02 15:00:32 D [socket.c:924:socket_connect] brick_10.98.98.2: > connect () called on transport already connected > 2009-02-02 15:00:32 N [client-protocol.c:5786:client_setvolume_cbk] > brick_10.98.98.2: connection and handshake succeeded > 2009-02-02 15:00:40 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: > 17399: STATFS > 2009-02-02 15:00:40 D [fuse-bridge.c:368:fuse_entry_cbk] glusterfs-fuse: > 17400: LOOKUP() / => 1 (1) > 200t9-02-02 15:00:42 D [client-protocol.c:5854:client_protocol_reconnect] > brick_10.98.98.2: breaking reconnect chain > > All seems to be ok but now i have this log : > ( a lot of times ) > > 2009-02-02 15:07:05 D [client-protocol.c:2799:client_fstat] > brick_10.98.98.2: (2148533016): failed to get remote fd. returning EBADFD > > then *stop 10.98.98.1 ( I tought that 10.98.98.2 is ok but EBADFD seems > to be not ! )* This is a known issue in afr for files which remain open across the time frame when a server goes down and comes back. Ideally afr should've issued reopen for those files once the server comes back. But currently its not doing so. ** > > 2009-02-02 15:10:30 D [page.c:644:ioc_frame_return] io-cache: locked > local(0x6309d0) > 2009-02-02 15:10:30 D [client-protocol.c:2799:client_fstat] > brick_10.98.98.2: (2148533016): failed to get remote fd. returning EBADFD > 2009-02-02 15:10:30 D [page.c:646:ioc_frame_return] io-cache: unlocked > local(0x6309d0) > 2009-02-02 15:10:30 D [io-cache.c:798:ioc_need_prune] io-cache: locked > table(0x614320) > 2009-02-02 15:10:30 D [io-cache.c:802:ioc_need_prune] io-cache: unlocked > table(0x614320) > 2009-02-02 15:10:30 D [client-protocol.c:2799:client_fstat] > brick_10.98.98.1: (2148533016): failed to get remote fd. returning EBADFD > 2009-02-02 15:10:30 D [io-cache.c:425:ioc_cache_validate_cbk] io-cache: > cache for inode(0x7fdce0002780) is invalid. flushing all pages > > > Now my client have problems with two servers ( fd ) > > so perhaps there is a problem, why 10.98.98.2 is online but client tells > EBADFD. > > Regard, > Nicolas > > > > On Mon, Feb 2, 2009 at 3:30 PM, nicolas prochazka < > prochazka.nico...@gmail.com> wrote: > >> hi again, >> last test and last log before stop for me : >> I do a change, i add option read-subvolume brick_10.98.98.2 in client conf >> 10.98.98.48 >> and option read-subvolume brick_10.98.98.1 in client conf 10.98.98.44 >> >> run 10.98.98.1 and 10.98.98.2 as server >> run 10.98.98.44 and 10.98.98.48 as client >> >> 1 - stop 10.98.98.2 >> 10.98.98.48 always run and go read to 10.98.98.1 >> 10.98.98.44 always run , 10.98.98.1 >> >> 2 - rerun 10.98.98.2 , waiting 5 minutes >> >> 3 - stop 10.98.98.1 >> process 10.98.98.44 / 48 are hanging >> >> I think, client can not re read to 10.98.98.2 , is it normal ? >> 10.98.98.2 is become ready after crash. >> >> >> Regards, >> Nico >> >> >> >> On Mon, Feb 2, 2009 at 2:25 PM, nicolas prochazka < >> prochazka.nico...@gmail.com> wrote: >> >>> hello >>> I always trying to debugging my strange and block problem. >>> I run client with log but there's a lot and a lot (100 mo ) so i can not >>> send you, just info : >>> >>> Server 10.98.98.1 and 10.98.98.2 >>> client 10.98.98.44 10.98.98.48 >>> >>> Test : ( all tests is performe with big file ( > 10G ) sometimes the test >>> hangs process, sometimes, big file become corrupte ( there's seem that's >>> some data is lacking ) >>> >>> run all system. : ok >>> stop : 10.98.98.2 : client seems ok >>> run 10.98.98.2 : sometime it block >>> stop 10.98.98.1 : client 10.98.98.44 is blocking : last log is : >>> >>> 2009-02-02 13:53:59 D [io-cache.c:798:ioc_need_prune] io-cache: locked >>> table(0x614320) >>> 2009-02-02 13:53:59 D [io-cache.c:802:ioc_need_prune] io-cache: unlocked >>> table(0x614320) >>> 2009-02-02 13:53:59 D [client-protocol.c:1701:client_readv] >>> brick_10.98.98.2: (2148533016): failed to get remote fd, returning EBADFD >>> >>> and if i rerun 10.98.98.1 , client run again ( ls works ) and log : >>> >>> 2009-02-02 14:03:18 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: >>> 40423: STATFS >>> 2009-02-02 14:03:18 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: >>> 40424: STATFS >>> 2009-02-02 14:03:33 D [fuse-bridge.c:1945:fuse_statfs] glusterfs-fuse: >>> 40425: STATFS >>> >>> On client 10.98.98.48 , not block. >>> >>> >>> >>> >>> On Fri, Jan 30, 2009 at 10:14 AM, nicolas prochazka < >>> prochazka.nico...@gmail.com> wrote: >>> >>>> Hello, >>>> first thing, thanks a lot for all yours works. >>>> second, >>>> Your tests is ok for me but when i replace echo or tail by opening a >>>> file with certains type of program, >>>> as qemu for example, there's a lot of problem. Process hangs, I also try >>>> with --disable-direct-io-mode then process do not hang but file seems to >>>> be >>>> corrupted. >>>> It's very strange problem. >>>> >>>> Regards, >>>> Nicolas Prochazka. >>>> >>>> 2009/1/30 Raghavendra G <raghaven...@zresearch.com> >>>> >>>> nicolas, >>>>> >>>>> I've two servers n1 and n2 which are being afred from client side. I am >>>>> using the same configuration you finalized on for which you are facing the >>>>> problem. n1 is the first child of afr. >>>>> >>>>> on n1: >>>>> ifconfig eth0 down (eth0 is the interface I am using for communicating >>>>> with server on n1) >>>>> >>>>> on glusterfs mount: >>>>> 1. ls (hangs for transport-timeout seconds but completes successfully >>>>> after timeout) >>>>> 2. I also had a file opened with tail -f /mnt/glusterfs/file before >>>>> bringing down eth0 on n1. >>>>> 3. echo "content" >> /mnt/glusterfs/file, appends to file and I was >>>>> able to observe the content through tail -f. >>>>> >>>>> on n1: >>>>> bring up eth0 >>>>> >>>>> on glusterfs mount: >>>>> 1. ls (completes successfully without any problem). >>>>> 2. echo "content-2" >> /mnt/glusterfs/file (also appends content-2 to >>>>> file and shown in the output of tail -f) >>>>> >>>>> From the above tests, it seems the bug is not reproducible in our >>>>> setup. Is this the similar procedure you followed to reproduce the bug? I >>>>> am >>>>> using glusterfs--mainline--3.0--patch-883. >>>>> >>>>> regards, >>>>> >>>>> >>>>> On Fri, Jan 30, 2009 at 12:05 AM, Anand Avati <av...@zresearch.com>wrote: >>>>> >>>>>> Raghu/ Krishna, >>>>>> can you guys look into this? It seems like a serious flaw.. >>>>>> >>>>>> avati >>>>>> >>>>>> On Thu, Jan 29, 2009 at 7:13 PM, nicolas prochazka >>>>>> <prochazka.nico...@gmail.com> wrote: >>>>>> > hello again, >>>>>> > to be more precise, >>>>>> > now i can do 'ls /glustermountpoint ' after timeout in all cases, >>>>>> that's >>>>>> > good >>>>>> > but, for files which be opened before the crash of first server, >>>>>> that do not >>>>>> > work, process seems to be block. >>>>>> > >>>>>> > Regards, >>>>>> > Nicolas. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Raghavendra G >>>>> >>>>> >>>> >>> >> > -- Raghavendra G
_______________________________________________ Gluster-devel mailing list Gluster-devel@nongnu.org http://lists.nongnu.org/mailman/listinfo/gluster-devel