Re: [Gluster-users] Transport Endpoint Not Connected When Writing a Lot of Files

2019-11-08 Thread DUCARROZ Birgit

Hi again,

This time I seem to have more information:
I get the error also when there is not a lot of network traffic.
Actually, the following broadcast msg was sent:

root@nas20:/var/log/glusterfs#
Broadcast message from systemd-journald@nas20 (Fri 2019-11-08 12:20:25 CET):

bigdisk-brick1-vol-users[6115]: [2019-11-08 11:20:25.849956] M [MSGID: 
113075] [posix-helpers.c:1962:posix_health_check_thread_proc] 
0-vol-users-posix: health-check failed, going down



Broadcast message from systemd-journald@dnas20 (Fri 2019-11-08 12:20:25 
CET):


bigdisk-brick1-vol-users[6115]: [2019-11-08 11:20:25.850170] M [MSGID: 
113075] [posix-helpers.c:1981:posix_health_check_thread_proc] 
0-vol-users-posix: still alive! -> SIGTERM


The only thing that helps is to stop and restart the volume with a mount -a

I already did a fsck on the underlying volume - there seems to be no errors.

Any ideas?

Kind regards,
Birgit


On 16/10/19 10:18, DUCARROZ Birgit wrote:

Thank you for your response!

it does not reply.
but all other servers neither reply to ping  24007
pinged this from each server to each server, no reply at all.
even with firewall disabled it does not reply.

root@diufnas22:/home/diuf-sysadmin# netstat -tulpe |grep 24007
tcp    0  0 *:24007 *:* LISTEN  root   
47716960    141881/glusterd


Firewall:
8458,24007,24008,49150,49151,49152,49153,49154,49155,49156,49157,49158/tcp 
(v6) on bond1 ALLOW   Anywhere (v6)
8458,24007,24008,49150,49151,49152,49153,49154,49155,49156,49157,49158/tcp 
(v6) on bond0 ALLOW   Anywhere (v6)


Kind regards,
Birgit


On 16/10/19 05:59, Amar Tumballi wrote:
Went through the design. In my opinion, this makes sense, ie, as long 
as you can use better faster network as alternative to reach server, 
it is fine. And considering gluster servers are stateless, that 
shouldn't cause a problem.


Coming to the my suspicion of an issue of network in this particular 
issue comes from the log which mentions 'No route to host'  in the 
log. That particular log prints errno from RPC layer's connect() 
system call.


Snippet from `man 3 connect`

        EHOSTUNREACH
               The destination host cannot be reached (probably
    because the  host  is  down  or  a  remote
               router cannot reach it).


It may be working from few servers, but on the ones you are getting 
this error, can you check if you can reach the server (specific IP), 
using 'ping' or 'telnet  24007' ?


Regards,
Amar




On Tue, Oct 15, 2019 at 5:00 PM DUCARROZ Birgit 
mailto:birgit.ducar...@unifr.ch>> wrote:


    Hi,

    I send you this mail without sending it to the gluster mailing list
    because of the pdf attachment (I do not want it to be published).

    Do you think it might be because of the different IP address of
    diufnas22 where the arbiter brick 3 is installed?

    2 hosts communicate with internal 192.168.x.x Network whereas the 3rd
    host diufnas22 is connected to the two other hosts via a switch, 
using

    ip address 134.21.x.x. Both networks have a speed of 20gb (2 x 10gb
    using lacp bond).

    (The scheme shows this).

    I would be able to remove the 192.168.x.x network, but my aim was to
    speed up the network using an internal communication between bricks
    1 and 2.

    If this is really the problem, why does the installation mostly works
    and crush down when there is heavy network usage with a lot of small
    files?


    Kind regards
    --     Birgit Ducarroz
    Unix Systems Administration
    Department of Informatics
    University of Fribourg Switzerland
    mailto:birgit.ducar...@unifr.ch 
    Phone: +41 (26) 300 8342
    https://diuf.unifr.ch/people/ducarroz/
    INTRANET / SECURITY NEWS: https://diuf-file.unifr.ch

    On 14/10/19 12:47, Amar Tumballi wrote:
 > One of the host ( 134.21.57.122 ) is
    not
 > reachable from your network. Also checking at the IP, it would 
have

 > gotten resolved to something else than expected. Can you check if
 > 'diufnas22' is properly resolved?
 >
 > -Amar
 >
 > On Mon, Oct 14, 2019 at 3:44 PM DUCARROZ Birgit
 > mailto:birgit.ducar...@unifr.ch>
    >>
    wrote:
 >
 >     Thank you.
 >     I checked the logs but the information was not clear to me.
 >
 >     I add the log of two different crashes. I will do an 
upgrade to

 >     glusterFS 6 in some weeks. Actually I cannot interrupt user
    activity on
 >     these servers since we are in the middle of the uni-semester.
 >
 >     If these logfiles reveal something interesting to you, would
    be nice to
 >     get a hint.
 >
 >
 >     ol-data-client-2. Client process will keep trying to 
connect to

 >     glusterd
 >     until brick's port is available
 >     [2019-09-16 19:05:34.028164] E
    

Re: [Gluster-users] Sudden, dramatic performance drops with Glusterfs

2019-11-08 Thread Michael Rightmire

Hi Strahil,

Thanks for the reply. See below.

Also, as an aside, I tested by installing a single Cenots 7 machine with 
the ZBOD, installed gluster and ZFSonLinux as recommended at..

https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Administrator%20Guide/Gluster%20On%20ZFS/

And created a gluster volume consisting of one brick made up of a local 
ZFS raidz2, copied about 4 TB of data to it, and am having the same issue.


The biggest part of the issue is with things like "ls" and "find". IF I 
read a single file, or write a single file it works great. But if I run 
rsync (which does alot of listing, writing, renaming, etc) it is slow as 
garbage. I.e. a find command that will finish in 30 seconds when run 
directly on the underlying ZFS directory, takes about an hour.



Strahil wrote on 08-Nov-19 05:39:


Hi Michael,

What is your 'gluster volume info  ' showing.

I've been playing with the install (since it's a fresh machine) so I 
can't give you verbatim output. However, it was showing two bricks, one 
on each server, started, and apparently healthy.


How much is your zpool full ? Usually when it gets too full, the ZFS 
performance drops seriosly.



The zpool is only at about 30% usage. It's a new server setup.
We have about 10TB of data on a 30TB volume (made up of two 30TB ZFS 
raidz2 bricks, each residing on different servers, via a 10GB dedicated 
Ethernet connection.)


Try to rsync a file directly to one of the bricks, then to the other 
brick (don't forget to remove the files after that, as gluster will 
not know about them).


If I rsync manually, or scp a file directly to the zpool bricks (outside 
of gluster) I get 30-100MBytes/s (depending on what I'm copying.)

If I rsync THROUGH gluster (via the glusterfs mounts) I get 1 - 5MB/s


What are your mounting options ? Usually 'noatime,nodiratime' are a 
good start.



I'll try these. Currently using ...
(mounting TO serverA) serverA:/homes /glusterfs/homes    glusterfs 
defaults,_netdev 0 0


Are you using ZFS provideed by Ubuntu packagees or directly from ZOL 
project ?



ZFS provided by Ubuntu 18 repo...
  libzfs2linux/bionic-updates,now 0.7.5-1ubuntu16.6 amd64 
[installed,automatic]
  zfs-dkms/bionic-updates,bionic-updates,now 0.7.5-1ubuntu16.6 all 
[installed]

  zfs-zed/bionic-updates,now 0.7.5-1ubuntu16.6 amd64 [installed,automatic]
  zfsutils-linux/bionic-updates,now 0.7.5-1ubuntu16.6 amd64 [installed]

Gluster provided by. "add-apt-repository ppa:gluster/glusterfs-5" ...
  glusterfs 5.10
  Repository revision: git://git.gluster.org/glusterfs.git



Best Regards,
Strahil Nikolov

On Nov 6, 2019 12:50, Michael Rightmire  wrote:

Hello list!

I'm new to Glusterfs in general. We have chosen to use it as our
distributed file system on a new set of HA file servers.

The setup is:
2 SUPERMICRO SuperStorage Server 6049PE1CR36L with 24-4TB spinning
disks and NVMe for cache and slog.
HBA not RAID card
Ubuntu 18.04 server (on both systems)
ZFS filestorage
Glusterfs 5.10

Step one was to install Ubuntu, ZFS, and gluster. This all went
without issue.
We have 3 ZFS raidz2 identical on both servers
We have three glusterfs mirrored volumes - 1 attached to each
raidz on each server. I.e.

And mounted the gluster volumes as (for example) "/glusterfs/homes
-> /zpool/homes". I.e.
gluster volume create homes replica 2 transport tcp
server1:/zpool-homes/homes server2:/zpool-homes/homes force
(on server1) server1:/homes 44729413504 16032705152
28696708352  36% /glusterfs/homes

The problem is, the performance has deteriorated terribly.
We needed to copy all of our data from the old server to the new
glusterfs volumes (appx. 60TB).
We decided to do this with multiple rsync commands (like 400
simultanous rsyncs)
The copy went well for the first 4 days, with an average across
all rsyncs of 150-200 MBytes per second.
Then, suddenly, on the fourth day, it dropped to about 50 MBytes/s.
Then, by the end of the day, down to ~5MBytes/s (five).
I've stopped the rsyncs, and Ican still copy an individual file
across to the glusterfs shared directory at 100MB/s.
But actions such as "ls -la" or "find" take forever!

Are there obvious flaws in my setup to correct?
How can I better troubleshoot this?

Thanks!
-- 


Mike



--

Mike

Karlsruher Institut für Technologie (KIT)

Institut für Anthropomatik und Robotik (IAR)

Hochperformante Humanoide Technologien (H2T)

Michael Rightmire

B.Sci, HPUXCA, MCSE, MCP, VDB, ISCB

Systems IT/Development

Adenauerring 2 , Gebäude 50.20, Raum 022

76131 Karlsruhe

Telefon: +49 721 608-45032

Fax:+49 721 608-44077

E-Mail:michael.rightm...@kit.edu

http://www.humanoids.kit.edu/

http://h2t.anthropomatik.kit.edu 

KIT – Die Forschungsuniversität in der Helmholtz-Gemeinschaft

Das KIT ist seit 2010 als familiengerechte Hochschule