[Gluster-users] glfs_fallocate() looks completely broken on disperse volumes with sharding enabled

2020-04-09 Thread Dmitry Antipov

(closely related to https://github.com/gluster/glusterfs/issues/1148)

1)

# prove -vf ./tests/bugs/shard/zero-flag.t
./tests/bugs/shard/zero-flag.t ..
1..34
ok   1 [190/   1454] <  13> 'glusterd'
ok   2 [  9/  5] <  14> 'pidof glusterd'
ok   3 [  9/201] <  15> 'gluster --mode=script --wignore volume create patchy replica 2 localhost.localdomain:/d/backends/patchy0 localhost.localdomain:/d/backends/patchy1 
localhost.localdomain:/d/backends/patchy2 localhost.localdomain:/d/backends/patchy3'

ok   4 [ 13/155] <  16> 'gluster --mode=script --wignore volume set 
patchy features.shard on'
ok   5 [ 30/198] <  17> 'gluster --mode=script --wignore volume set 
patchy features.shard-block-size 4MB'
ok   6 [ 30/   1463] <  18> 'gluster --mode=script --wignore volume start 
patchy'
ok   7 [ 11/ 57] <  20> '_GFS --attribute-timeout=0 --entry-timeout=0 
--volfile-id=patchy --volfile-server=localhost.localdomain /mnt/glusterfs/0'
ok   8 [ 10/ 90] <  21> 'build_tester 
./tests/bugs/shard/shard-fallocate.c -lgfapi -Wall -O2'
ok   9 [ 10/  7] <  25> 'touch /mnt/glusterfs/0/tmp'
ok  10 [ 13/  1] <  26> ''
ok  11 [ 10/  7] <  27> 'touch /mnt/glusterfs/0/file1'
ok  12 [ 20/  11049] <  31> './tests/bugs/shard/shard-fallocate 
localhost.localdomain patchy 0 0 6291456 /file1 
/opt/glusterfs/var/log/glusterfs/glfs-patchy.log'
ok  13 [ 17/  4] <  33> '6291456 stat -c %s /mnt/glusterfs/0/file1'
ok  14 [ 10/  2] <  36> 'stat /d/backends/patchy0/.shard'
ok  15 [ 10/  2] <  37> 'stat /d/backends/patchy1/.shard'
ok  16 [  9/  2] <  38> 'stat /d/backends/patchy2/.shard'
ok  17 [  9/  2] <  39> 'stat /d/backends/patchy3/.shard'
ok  18 [ 25/  2] <  41> '2097152 echo 2097152 2097152'
ok  19 [  9/ 17] <  42> '1 file_all_zeroes /mnt/glusterfs/0/file1'
ok  20 [  8/  7] <  47> 'truncate -s 6M /mnt/glusterfs/0/file2'
ok  21 [  9/  6] <  48> 'dd if=/mnt/glusterfs/0/tmp 
of=/mnt/glusterfs/0/file2 bs=1 seek=3145728 count=26 conv=notrunc'
ok  22 [ 32/  11045] <  51> './tests/bugs/shard/shard-fallocate 
localhost.localdomain patchy 0 3145728 26 /file2 
/opt/glusterfs/var/log/glusterfs/glfs-patchy.log'
ok  23 [ 17/  4] <  53> '6291456 stat -c %s /mnt/glusterfs/0/file2'
ok  24 [ 27/  2] <  54> '007d0186a1231a3a874a6aa09a1b7dcf echo 
007d0186a1231a3a874a6aa09a1b7dcf'
ok  25 [  9/  7] <  59> 'touch /mnt/glusterfs/0/file3'
ok  26 [ 13/  7] <  63> 'dd if=/mnt/glusterfs/0/tmp 
of=/mnt/glusterfs/0/file3 bs=1 seek=9437184 count=26 conv=notrunc'
ok  27 [ 10/  2] <  64> '! stat 
/d/backends/patchy*/.shard/cfa95fe6-8367-4478-957c-edf8561dab21.1'
ok  28 [  9/  2] <  65> 'stat 
/d/backends/patchy0/.shard/cfa95fe6-8367-4478-957c-edf8561dab21.2 
/d/backends/patchy1/.shard/cfa95fe6-8367-4478-957c-edf8561dab21.2'
ok  29 [ 52/  2] <  67> '1048602 echo 1048602 1048602'
ok  30 [ 15/  11046] <  69> './tests/bugs/shard/shard-fallocate 
localhost.localdomain patchy 0 5242880 1048576 /file3 
/opt/glusterfs/var/log/glusterfs/glfs-patchy.log'
ok  31 [ 41/  2] <  70> 'db677843bb19004c18b71597026b2181 echo 
db677843bb19004c18b71597026b2181'
ok  32 [  9/  7] <  72> 'Y force_umount /mnt/glusterfs/0'
ok  33 [  9/   5152] <  73> 'gluster --mode=script --wignore volume stop 
patchy'
ok  34 [ 17/580] <  74> 'gluster --mode=script --wignore volume delete 
patchy'
ok
All tests successful.
Files=1, Tests=34, 43 wallclock secs ( 0.03 usr  0.00 sys +  1.14 cusr  0.67 
csys =  1.84 CPU)
Result: PASS

2)

diff --git a/tests/bugs/shard/zero-flag.t b/tests/bugs/shard/zero-flag.t
index 1f39787ab..9332a7fc7 100644
--- a/tests/bugs/shard/zero-flag.t
+++ b/tests/bugs/shard/zero-flag.t
@@ -12,7 +12,7 @@ require_fallocate -z -l 512k $M0/file && rm -f $M0/file

 TEST glusterd
 TEST pidof glusterd
-TEST $CLI volume create $V0 replica 2 $H0:$B0/${V0}{0,1,2,3}
+TEST $CLI volume create $V0 disperse 3 redundancy 1 $H0:$B0/${V0}{0,1,2}
 TEST $CLI volume set $V0 features.shard on
 TEST $CLI volume set $V0 features.shard-block-size 4MB
 TEST $CLI volume start $V0

3)

# prove -vf ./tests/bugs/shard/zero-flag.t
./tests/bugs/shard/zero-flag.t ..
1..34
ok   1 [191/   1435] <  13> 'glusterd'
ok   2 [  9/  5] <  14> 'pidof glusterd'
ok   3 [  9/159] <  15> 'gluster --mode=script --wignore volume create patchy disperse 3 redundancy 1 localhost.localdomain:/d/backends/patchy0 localhost.localdomain:/d/backends/patchy1 
localhost.localdomain:/d/backends/patchy2'

ok   4 [  9/138] <  16> 'gluster --mode=script --wignore volume set 
patchy features.shard on'
ok   5 [ 10/135] <  17> 'gluster --mode=script --wignore volume set 
patchy features.shard-block-size 4MB'
ok   6 [ 12/   1397] <  18> 'gluster --mode=script --wignore volume start 
patchy'
ok   7 [ 12/ 43] <  20> '_GFS --attribute-timeout=0 --entry-timeout=0 
--volfile-id

Re: [Gluster-users] Impressive boot times for big clusters: NFS, Image Objects, and Sharding

2020-04-09 Thread Hari Gowtham
Hi Erik,

It's great to hear positive feedback! Thanks for taking out time to send
out this email. It means a lot to us :)

On Thu, Apr 9, 2020 at 10:55 AM Strahil Nikolov 
wrote:

> On April 8, 2020 10:15:27 PM GMT+03:00, Erik Jacobson <
> erik.jacob...@hpe.com> wrote:
> >I wanted to share some positive news with the group here.
> >
> >Summary: Using sharding and squashfs image files instead of expanded
> >directory trees for RO NFS OS images have led to impressive boot times
> >of
> >2k diskless node clusters using 12 servers for gluster+tftp+etc+etc.
> >
> >Details:
> >
> >As you may have seen in some of my other posts, we have been using
> >gluster to boot giant clusters, some of which are in the top500 list of
> >HPC resources. The compute nodes are diskless.
> >
> >Up until now, we have done this by pushing an operating system from our
> >head node to the storage cluster, which is made up of one or more
> >3-server/(3-brick) subvolumes in a distributed/replicate configuration.
> >The servers are also PXE-boot and tftboot servers and also serve the
> >"miniroot" (basically a fat initrd with a cluster manager toolchain).
> >We also locate other management functions there unrelated to boot and
> >root.
> >
> >This copy of the operating system is a simple a directory tree
> >representing the whole operating system image. You could 'chroot' in to
> >it, for example.
> >
> >So this operating system is a read-only NFS mount point used as a base
> >by all compute nodes to use as their root filesystem.
> >
> >This has been working well, getting us boot times (not including BIOS
> >startup) of between 10 and 15 minutes for a 2,000 node cluster.
> >Typically a
> >cluster like this would have 12 gluster/nfs servers in 3 subvolumes. On
> >simple
> >RHEL8 images without much customization, I tend to get 10 minutes.
> >
> >We have observed some slow-downs with certain job launch work loads for
> >customers who have very metadata intensive job launch. The metadata
> >load
> >of such an operation is very intensive, with giant loads being observed
> >on the gluster servers.
> >
> >We recently started supporting RW NFS as opposed to TMPFS for this
> >solution for the writable components of root. Our customers tend to
> >prefer
> >to keep every byte of memory for jobs. We came up with a solution of
> >hosting
> >the RW NFS sparse files with XFS filesystems on top from a writable
> >area in
> >gluster for NFS. This makes the RW NFS solution very fast because it
> >reduces
> >RW NFS metadata per-node. Boot times didn't go up significantly (but
> >our
> >first attempt with just using a directory tree was a slow disaster,
> >hitting
> >the worse-case lots of small file write + lots of metadata work load).
> >So we
> >solved that problem with XFS FS images on RW NFS.
> >
> >Building on that idea, we have in our development branch, a version of
> >the
> >solution that changes the RO NFS image to a squashfs file on a sharding
> >volume. That is, instead of each operating system being many thousands
> >of files and being (slowly) synced to the gluser servers, the head node
> >makes a squashfs file out of the image and pushes that. Then all the
> >compute nodes mount the squashfs image from the NFS mount.
> >  (mount RO NFS mount, loop-mount squashfs image).
> >
> >On a 2,000 node cluster I had access to for a time, our prototype got
> >us
> >boot times of 5 minutes -- including RO NFS with squashfs and the RW
> >NFS
> >for writable areas like /etc, /var, etc (on an XFS image file).
> >  * We also tried RW NFS with OVERLAY and no problem there
> >
> >I expect, for people who prefer the squashfs non-expanded format, we
> >can
> >reduce the leader per compute density.
> >
> >Now, not all customers will want squashfs. Some want to be able to edit
> >a file and see it instantly on all nodes. However, customers looking
> >for
> >fast boot times or who are suffering slowness on metadata intensive
> >job launch work loads, will have a new fast option.
> >
> >Therefore, it's very important we still solve the bug we're working on
> >in another thread. But I wanted to share something positive.
> >
> >So now I've said something positive instead of only asking for help :)
> >:)
> >
> >Erik
> >
> >
> >
> >
> >Community Meeting Calendar:
> >
> >Schedule -
> >Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> >Bridge: https://bluejeans.com/441850968
> >
> >Gluster-users mailing list
> >Gluster-users@gluster.org
> >https://lists.gluster.org/mailman/listinfo/gluster-users
>
> Good Job Erik!
>
> Best Regards,
> Strahil Nikolov
> 
>
>
>
> Community Meeting Calendar:
>
> Schedule -
> Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
> Bridge: https://bluejeans.com/441850968
>
> Gluster-users mailing list
> Gluster-users@gluster.org
> https://lists.gluster.org/mailman/listinfo/gluster-users
>
>

-- 
Regards,
Hari Gowtham.




Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge:

Re: [Gluster-users] Replica 2 to replica 3

2020-04-09 Thread Valerio Luccio

Hi,

I'm afraid I still need some help.

When I originally set up my gluster (about 3 years ago), I set it up as 
Distributed-Replicated without specifying a replica count and I believe 
that defaulted to a replica 2. I have 4 servers with 3 RAIDs attached to 
each server. This was my result:


   Number of Bricks: 6 x 2 = 12
   Transport-type: tcp
   Bricks:
   Brick1: hydra1:/gluster1/data
   Brick2: hydra1:/gluster2/data
   Brick3: hydra1:/gluster3/data
   Brick4: hydra2:/gluster1/data
   Brick5: hydra2:/gluster2/data
   Brick6: hydra2:/gluster3/data
   Brick7: hydra3:/gluster1/data
   Brick8: hydra3:/gluster2/data
   Brick9: hydra3:/gluster3/data
   Brick10: hydra4:/gluster1/data
   Brick11: hydra4:/gluster2/data
   Brick12: hydra4:/gluster3/data

If I understand this correctly I have 6 sub-volumes with Brick2 replica 
of Brick1, Brick4 of Brick3, etc. Correct ?


I realize now that it would probably have been better to specify a 
different order, but now I cannot change it.


Now I want to store oVirt images on the Gluster and it requires either 
replica 1 or replica 3. I need to be able to reuse the bricks I have and 
was planning to remove some bricks, initialize them and add them back as 
a replica 3.


Am I supposed to remove 6 bricks, one from each sub-volume ? Will that 
work ? Will I lose storage space ? Can I just remove a brick from each 
server and use those for the replica 3 ?


Thanks for all the help.

--
As a result of Coronavirus-related precautions, NYU and the Center for 
Brain Imaging operations will be managed remotely until further notice.
All telephone calls and e-mail correspondence are being monitored 
remotely during our normal business hours of 9am-5pm, Monday through 
Friday.
For MRI scanner-related emergency, please contact: Keith Sanzenbach at 
keith.sanzenb...@nyu.edu and/or Pablo Velasco at pablo.vela...@nyu.edu
For computer/hardware/software emergency, please contact: Valerio Luccio 
at valerio.luc...@nyu.edu
For TMS/EEG-related emergency, please contact: Chrysa Papadaniil at 
chr...@nyu.edu
For CBI-related administrative emergency, please contact: Jennifer 
Mangan at jennifer.man...@nyu.edu


Valerio Luccio  (212) 998-8736
Center for Brain Imaging4 Washington Place, Room 158
New York University New York, NY 10003

   "In an open world, who needs windows or gates ?"





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Replica 2 to replica 3

2020-04-09 Thread Strahil Nikolov
On April 9, 2020 5:16:47 PM GMT+03:00, Valerio Luccio  
wrote:
>Hi,
>
>I'm afraid I still need some help.
>
>When I originally set up my gluster (about 3 years ago), I set it up as
>
>Distributed-Replicated without specifying a replica count and I believe
>
>that defaulted to a replica 2. I have 4 servers with 3 RAIDs attached
>to 
>each server. This was my result:
>
>Number of Bricks: 6 x 2 = 12
>Transport-type: tcp
>Bricks:
>Brick1: hydra1:/gluster1/data
>Brick2: hydra1:/gluster2/data
>Brick3: hydra1:/gluster3/data
>Brick4: hydra2:/gluster1/data
>Brick5: hydra2:/gluster2/data
>Brick6: hydra2:/gluster3/data
>Brick7: hydra3:/gluster1/data
>Brick8: hydra3:/gluster2/data
>Brick9: hydra3:/gluster3/data
>Brick10: hydra4:/gluster1/data
>Brick11: hydra4:/gluster2/data
>Brick12: hydra4:/gluster3/data
>
>If I understand this correctly I have 6 sub-volumes with Brick2 replica
>
>of Brick1, Brick4 of Brick3, etc. Correct ?
>
>I realize now that it would probably have been better to specify a 
>different order, but now I cannot change it.
>
>Now I want to store oVirt images on the Gluster and it requires either 
>replica 1 or replica 3. I need to be able to reuse the bricks I have
>and 
>was planning to remove some bricks, initialize them and add them back
>as 
>a replica 3.
>
>Am I supposed to remove 6 bricks, one from each sub-volume ? Will that 
>work ? Will I lose storage space ? Can I just remove a brick from each 
>server and use those for the replica 3 ?
>
>Thanks for all the help.

Hi Valerio,

You can find a small system and make  it an arbiter ,  so you will end up with 
'replica 3  arbiter 1' .

The arbiter  storage calculation is that you need  4K for each file in the 
volume.If you don't know them  the general rule of thumb is your arbiter brick 
be 1/1024 the size of the data brick.

Keep in mind that it will be better  if your  arbiter has an SSD (for example 
mounted on /gluster) from which you can create 6 directories.The arbiter stores 
 only metadata and the SSD random access performance will be the optimal 
approach.

Something like:
arbiter:/gluster/data1
arbiter:/gluster/data2
arbiter:/gluster/data3
arbiter:/gluster/data4
arbiter:/gluster/data5
arbiter:/gluster/data6


Of course , proper testing on a test volume is a good approach before  
implementing on prod.

Best Regards,
Strahil Nikolov





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] Replica 2 to replica 3

2020-04-09 Thread Valerio Luccio

On 4/9/20 12:47 PM, Strahil Nikolov wrote:


On April 9, 2020 5:16:47 PM GMT+03:00, Valerio Luccio  
wrote:

Hi,

I'm afraid I still need some help.

When I originally set up my gluster (about 3 years ago), I set it up as

Distributed-Replicated without specifying a replica count and I believe

that defaulted to a replica 2. I have 4 servers with 3 RAIDs attached
to
each server. This was my result:

Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: hydra1:/gluster1/data
Brick2: hydra1:/gluster2/data
Brick3: hydra1:/gluster3/data
Brick4: hydra2:/gluster1/data
Brick5: hydra2:/gluster2/data
Brick6: hydra2:/gluster3/data
Brick7: hydra3:/gluster1/data
Brick8: hydra3:/gluster2/data
Brick9: hydra3:/gluster3/data
Brick10: hydra4:/gluster1/data
Brick11: hydra4:/gluster2/data
Brick12: hydra4:/gluster3/data

If I understand this correctly I have 6 sub-volumes with Brick2 replica

of Brick1, Brick4 of Brick3, etc. Correct ?

I realize now that it would probably have been better to specify a
different order, but now I cannot change it.

Now I want to store oVirt images on the Gluster and it requires either
replica 1 or replica 3. I need to be able to reuse the bricks I have
and
was planning to remove some bricks, initialize them and add them back
as
a replica 3.

Am I supposed to remove 6 bricks, one from each sub-volume ? Will that
work ? Will I lose storage space ? Can I just remove a brick from each
server and use those for the replica 3 ?

Thanks for all the help.

Hi Valerio,

You can find a small system and make  it an arbiter ,  so you will end up with 
'replica 3  arbiter 1' .

The arbiter  storage calculation is that you need  4K for each file in the 
volume.If you don't know them  the general rule of thumb is your arbiter brick 
be 1/1024 the size of the data brick.

Keep in mind that it will be better  if your  arbiter has an SSD (for example 
mounted on /gluster) from which you can create 6 directories.The arbiter stores 
 only metadata and the SSD random access performance will be the optimal 
approach.

Something like:
arbiter:/gluster/data1
arbiter:/gluster/data2
arbiter:/gluster/data3
arbiter:/gluster/data4
arbiter:/gluster/data5
arbiter:/gluster/data6


Of course , proper testing on a test volume is a good approach before  
implementing on prod.

Best Regards,
Strahil Nikolov


Thanks,

that sounds like a great idea.

--
As a result of Coronavirus-related precautions, NYU and the Center for 
Brain Imaging operations will be managed remotely until further notice.
All telephone calls and e-mail correspondence are being monitored 
remotely during our normal business hours of 9am-5pm, Monday through 
Friday.
For MRI scanner-related emergency, please contact: Keith Sanzenbach at 
keith.sanzenb...@nyu.edu and/or Pablo Velasco at pablo.vela...@nyu.edu
For computer/hardware/software emergency, please contact: Valerio Luccio 
at valerio.luc...@nyu.edu
For TMS/EEG-related emergency, please contact: Chrysa Papadaniil at 
chr...@nyu.edu
For CBI-related administrative emergency, please contact: Jennifer 
Mangan at jennifer.man...@nyu.edu


Valerio Luccio  (212) 998-8736
Center for Brain Imaging4 Washington Place, Room 158
New York University New York, NY 10003

   "In an open world, who needs windows or gates ?"





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users


Re: [Gluster-users] gnfs split brain when 1 server in 3x1 down (high load) - help request

2020-04-09 Thread Erik Jacobson
Once again thanks for sticking with us. Here is a reply from Scott
Titus. If you have something for us to try, we'd love it. The code had
your patch applied when gdb was run:


Here is the addr2line output for those addresses.  Very interesting command, of
which I was not aware.

[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f735
afr_lookup_metadata_heal_check
afr-common.c:2803
[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x6f0b9
afr_lookup_done
afr-common.c:2455
[root@leader3 ~]# addr2line -f -e /usr/lib64/glusterfs/7.2/xlator/cluster/
afr.so 0x5c701
afr_inode_event_gen_reset
afr-common.c:755

Thanks
-Scott


On Thu, Apr 09, 2020 at 11:38:04AM +0530, Ravishankar N wrote:
> 
> On 08/04/20 9:55 pm, Erik Jacobson wrote:
> > 9439138:[2020-04-08 15:48:44.737590] E 
> > [afr-common.c:754:afr_inode_event_gen_reset]
> > (-->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f735) 
> > [0x7fa4fb1cb735]
> > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x6f0b9) 
> > [0x7fa4fb1cb0b9]
> > -->/usr/lib64/glusterfs/7.2/xlator/cluster/replicate.so(+0x5c701) 
> > [0x7fa4fb1b8701] )
> > 0-cm_shared-replicate-0: Resetting event gen for 
> > f2d7abf0-5444-48d6-863d-4b128502daf9
> > 
> Could you print the function/line no. of each of these 3 functions in the
> backtrace and see who calls afr_inode_event_gen_reset? `addr2line` should
> give you that info:
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x6f735
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x6f0b9
>  addr2line -f -e /your/path/to/lib/glusterfs/7.2/xlator/cluster/afr.so
> 0x5c701
> 
> 
> I think it is likely called from afr_lookup_done, which I don't think is
> necessary. I will send a patch for review. Once reviews are over, I will
> share it with you and if it fixes the issue in your testing, we can merge it
> with confidence.
> 
> Thanks,
> Ravi





Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@gluster.org
https://lists.gluster.org/mailman/listinfo/gluster-users