are all the volumes being configured with sharding? On Mon, Jan 7, 2019 at 5:35 PM Matt Waymack <mwaym...@nsgdv.com> wrote:
> I think that I can rule out network as I have multiple volumes on the same > nodes and not all volumes are affected. Additionally, access via SMB using > samba-vfs-glusterfs is not affected, even on the same volumes. This is > seemingly only affecting the FUSE clients. > > > > *From:* Davide Obbi <davide.o...@booking.com> > *Sent:* Sunday, January 6, 2019 12:26 PM > *To:* Raghavendra Gowdappa <rgowd...@redhat.com> > *Cc:* Matt Waymack <mwaym...@nsgdv.com>; gluster-users@gluster.org List < > gluster-users@gluster.org> > *Subject:* Re: [External] Re: [Gluster-users] Input/output error on FUSE > log > > > > Hi, > > > > i would start doing some checks like: "(Input/output error)" seems > returned by the operating system, this happens for instance trying to > access a file system which is on a device not available so i would check > the network connectivity between the client to servers and server to > server during the reported time. > > > > Regards > > Davide > > > > On Sun, Jan 6, 2019 at 3:32 AM Raghavendra Gowdappa <rgowd...@redhat.com> > wrote: > > > > > > On Sun, Jan 6, 2019 at 7:58 AM Raghavendra Gowdappa <rgowd...@redhat.com> > wrote: > > > > > > On Sun, Jan 6, 2019 at 4:19 AM Matt Waymack <mwaym...@nsgdv.com> wrote: > > Hi all, > > > > I'm having a problem writing to our volume. When writing files larger > than about 2GB, I get an intermittent issue where the write will fail and > return Input/Output error. This is also shown in the FUSE log of the > client (this is affecting all clients). A snip of a client log is below: > > [2019-01-05 22:39:44.581371] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51040978: WRITE => -1 > gfid=82a0b5c4-7ef3-43c2-ad86-41e16673d7c2 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:44.598392] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51040979: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:47.420920] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041266: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949809b7f8 (Input/output > error) > > [2019-01-05 22:39:47.433377] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041267: FLUSH() ERR => -1 (Input/output error) > > [2019-01-05 22:39:50.441531] W [fuse-bridge.c:2474:fuse_writev_cbk] > 0-glusterfs-fuse: 51041548: WRITE => -1 > gfid=0e8e1e13-97a5-478a-bc58-e81ddf3698a3 fd=0x7f949839a368 (Input/output > error) > > [2019-01-05 22:39:50.451914] W [fuse-bridge.c:1441:fuse_err_cbk] > 0-glusterfs-fuse: 51041549: FLUSH() ERR => -1 (Input/output error) > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1311504267" repeated 1721 times > between [2019-01-05 22:39:33.906241] and [2019-01-05 22:39:44.598371] > > The message "E [MSGID: 101046] [dht-common.c:1502:dht_lookup_dir_cbk] > 0-gv1-dht: dict is null" repeated 1714 times between [2019-01-05 > 22:39:33.925981] and [2019-01-05 22:39:50.451862] > > The message "W [MSGID: 109011] [dht-layout.c:163:dht_layout_search] > 0-gv1-dht: no subvolume for hash (value) = 1137142622" repeated 1707 times > between [2019-01-05 22:39:39.636552] and [2019-01-05 22:39:50.451895] > > > > This looks to be a DHT issue. Some questions: > > * Are all subvolumes of DHT up and client is connected to them? > Particularly the subvolume which contains the file in question. > > * Can you get all extended attributes of parent directory of the file from > all bricks? > > * set diagnostics.client-log-level to TRACE, capture these errors again > and attach the client log file. > > > > I spoke a bit early. dht_writev doesn't search hashed subvolume as its > already been looked up in lookup. So, these msgs looks to be of a different > issue - not writev failure. > > > > > > This is intermittent for most files, but eventually if a file is large > enough it will not write. The workflow is SFTP tot he client which then > writes to the volume over FUSE. When files get to a certain point,w e can > no longer write to them. The file sizes are different as well, so it's not > like they all get to the same size and just stop either. I've ruled out a > free space issue, our files at their largest are only a few hundred GB and > we have tens of terrabytes free on each brick. We are also sharding at 1GB. > > > > I'm not sure where to go from here as the error seems vague and I can only > see it on the client log. I'm not seeing these errors on the nodes > themselves. This is also seen if I mount the volume via FUSE on any of the > nodes as well and it is only reflected in the FUSE log. > > > > Here is the volume info: > > Volume Name: gv1 > > Type: Distributed-Replicate > > Volume ID: 1472cc78-e2a0-4c3f-9571-dab840239b3c > > Status: Started > > Snapshot Count: 0 > > Number of Bricks: 8 x (2 + 1) = 24 > > Transport-type: tcp > > Bricks: > > Brick1: tpc-glus4:/exp/b1/gv1 > > Brick2: tpc-glus2:/exp/b1/gv1 > > Brick3: tpc-arbiter1:/exp/b1/gv1 (arbiter) > > Brick4: tpc-glus2:/exp/b2/gv1 > > Brick5: tpc-glus4:/exp/b2/gv1 > > Brick6: tpc-arbiter1:/exp/b2/gv1 (arbiter) > > Brick7: tpc-glus4:/exp/b3/gv1 > > Brick8: tpc-glus2:/exp/b3/gv1 > > Brick9: tpc-arbiter1:/exp/b3/gv1 (arbiter) > > Brick10: tpc-glus4:/exp/b4/gv1 > > Brick11: tpc-glus2:/exp/b4/gv1 > > Brick12: tpc-arbiter1:/exp/b4/gv1 (arbiter) > > Brick13: tpc-glus1:/exp/b5/gv1 > > Brick14: tpc-glus3:/exp/b5/gv1 > > Brick15: tpc-arbiter2:/exp/b5/gv1 (arbiter) > > Brick16: tpc-glus1:/exp/b6/gv1 > > Brick17: tpc-glus3:/exp/b6/gv1 > > Brick18: tpc-arbiter2:/exp/b6/gv1 (arbiter) > > Brick19: tpc-glus1:/exp/b7/gv1 > > Brick20: tpc-glus3:/exp/b7/gv1 > > Brick21: tpc-arbiter2:/exp/b7/gv1 (arbiter) > > Brick22: tpc-glus1:/exp/b8/gv1 > > Brick23: tpc-glus3:/exp/b8/gv1 > > Brick24: tpc-arbiter2:/exp/b8/gv1 (arbiter) > > Options Reconfigured: > > performance.cache-samba-metadata: on > > performance.cache-invalidation: off > > features.shard-block-size: 1000MB > > features.shard: on > > transport.address-family: inet > > nfs.disable: on > > cluster.lookup-optimize: on > > > > I'm a bit stumped on this, any help is appreciated. Thank you! > > > > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users@gluster.org > https://lists.gluster.org/mailman/listinfo/gluster-users > > > > -- > > *Davide Obbi* > > Senior System Administrator > > Booking.com B.V. > Vijzelstraat 66-80 Amsterdam 1017HL Netherlands > > Direct +31207031558 > > *[image: Booking.com] <https://www.booking.com/>* > > Empowering people to experience the world since 1996 > > 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 > million reported listings > Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG) > -- Davide Obbi Senior System Administrator Booking.com B.V. Vijzelstraat 66-80 Amsterdam 1017HL Netherlands Direct +31207031558 [image: Booking.com] <https://www.booking.com/> Empowering people to experience the world since 1996 43 languages, 214+ offices worldwide, 141,000+ global destinations, 29 million reported listings Subsidiary of Booking Holdings Inc. (NASDAQ: BKNG)
_______________________________________________ Gluster-users mailing list Gluster-users@gluster.org https://lists.gluster.org/mailman/listinfo/gluster-users