[Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread 한대규
Hi, I'm Daegyu from Sungkyunkwan University.I'm curious how GFS2's filesystem metadata is shared between nodes.In detail, I wonder how the metadata in the memory of the node mounting GFS2 looks the consistent filesystem to other nodes.In addition, what role does corosync play in gfs2?Thank you,Daegyu 

Re: [Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread Andrew Price

Hi Daegyu,

On 09/08/2019 09:10, 한대규 wrote:

Hi, I'm Daegyu from Sungkyunkwan University.

I'm curious how GFS2's filesystem metadata is shared between nodes.


The key thing to know about gfs2 is that it is a shared storage 
filesystem where each node mounts the same storage device. It is 
different from a distributed filesystem where each node has storage 
devices that only it accesses.



In detail, I wonder how the metadata in the memory of the node mounting GFS2
looks the consistent filesystem to other nodes.


gfs2 uses dlm for locking of filesystem metadata among the nodes. The 
transfer of locks between nodes allows gfs2 to decide when its in-memory 
caches are invalid and require re-reading from the storage.



In addition, what role does corosync play in gfs2?


gfs2 doesn't communicate with corosync directly but it operates on top 
of a high-availability cluster. corosync provides synchronization and 
coherency for the cluster. If a node stops responding, corosync will 
notice and trigger actions (fencing) to make sure that node is put back 
into a safe and consistent state. This is important in gfs2 to prevent 
"misbehaving" nodes from corrupting the filesystem.


Hope this helps.

Cheers,
Andy



Re: [Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread Andrew Price

On 09/08/2019 12:01, Daegyu Han wrote:

Thank you for your reply.

If what I understand is correct,
In a gfs2 file system shared by clients A and B, if A creates /foo/a.txt,
does B re-read the filesystem metadata area on storage to keep the data
consistent?


Yes, that's correct, although 'clients' is inaccurate as there is no 
'server'. Through the locking mechanism, B would know to re-read block 
allocation states and the contents of the /foo directory, so a path 
lookup on B would then find a.txt.



After all, what makes gfs2 different from local filesystems like ext4,
because of lock_dlm?


Exactly.


In general, if we mount an ext4 file system on two different clients and
update the file system on each client, we know that the file system state
is not reflected in each other.


Yes.

Cheers,
Andy


Thank you,
Daegyu
ᐧ

2019년 8월 9일 (금) 오후 7:50, Andrew Price 님이 작성:


Hi Daegyu,

On 09/08/2019 09:10, 한대규 wrote:

Hi, I'm Daegyu from Sungkyunkwan University.

I'm curious how GFS2's filesystem metadata is shared between nodes.


The key thing to know about gfs2 is that it is a shared storage
filesystem where each node mounts the same storage device. It is
different from a distributed filesystem where each node has storage
devices that only it accesses.


In detail, I wonder how the metadata in the memory of the node mounting

GFS2

looks the consistent filesystem to other nodes.


gfs2 uses dlm for locking of filesystem metadata among the nodes. The
transfer of locks between nodes allows gfs2 to decide when its in-memory
caches are invalid and require re-reading from the storage.


In addition, what role does corosync play in gfs2?


gfs2 doesn't communicate with corosync directly but it operates on top
of a high-availability cluster. corosync provides synchronization and
coherency for the cluster. If a node stops responding, corosync will
notice and trigger actions (fencing) to make sure that node is put back
into a safe and consistent state. This is important in gfs2 to prevent
"misbehaving" nodes from corrupting the filesystem.

Hope this helps.

Cheers,
Andy









Re: [Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread Daegyu Han
Thank you for your reply.

If what I understand is correct,
In a gfs2 file system shared by clients A and B, if A creates /foo/a.txt,
does B re-read the filesystem metadata area on storage to keep the data
consistent?

After all, what makes gfs2 different from local filesystems like ext4,
because of lock_dlm?

In general, if we mount an ext4 file system on two different clients and
update the file system on each client, we know that the file system state
is not reflected in each other.

Thank you,
Daegyu
ᐧ

2019년 8월 9일 (금) 오후 7:50, Andrew Price 님이 작성:

> Hi Daegyu,
>
> On 09/08/2019 09:10, 한대규 wrote:
> > Hi, I'm Daegyu from Sungkyunkwan University.
> >
> > I'm curious how GFS2's filesystem metadata is shared between nodes.
>
> The key thing to know about gfs2 is that it is a shared storage
> filesystem where each node mounts the same storage device. It is
> different from a distributed filesystem where each node has storage
> devices that only it accesses.
>
> > In detail, I wonder how the metadata in the memory of the node mounting
> GFS2
> > looks the consistent filesystem to other nodes.
>
> gfs2 uses dlm for locking of filesystem metadata among the nodes. The
> transfer of locks between nodes allows gfs2 to decide when its in-memory
> caches are invalid and require re-reading from the storage.
>
> > In addition, what role does corosync play in gfs2?
>
> gfs2 doesn't communicate with corosync directly but it operates on top
> of a high-availability cluster. corosync provides synchronization and
> coherency for the cluster. If a node stops responding, corosync will
> notice and trigger actions (fencing) to make sure that node is put back
> into a safe and consistent state. This is important in gfs2 to prevent
> "misbehaving" nodes from corrupting the filesystem.
>
> Hope this helps.
>
> Cheers,
> Andy
>
>
>


Re: [Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread Daegyu Han
Thank you for the clarification.

I have one more question.

I've seen some web page by redhat and it says that gfs2 has a poor
filesystem performance (i.e. throughput) compared to xfs or ext4.
[image: image.png]

In a high performance hardware environment (nvme over fabric, infiniband
(56G)), I ran a FIO benchmark, expecting GFS2 to be comparable to local
filesystems (ext4, xfs).

Unexpectedly, however, GFS2 showed 25% lower IOPS or throughput than ext4,
as the web page results.

Does GFS2 perform worse than EXT4 or XFS even on high-performance network +
storage?

Thank you,
Daegyu
ᐧ

2019년 8월 9일 (금) 오후 8:26, Andrew Price 님이 작성:

> On 09/08/2019 12:01, Daegyu Han wrote:
> > Thank you for your reply.
> >
> > If what I understand is correct,
> > In a gfs2 file system shared by clients A and B, if A creates /foo/a.txt,
> > does B re-read the filesystem metadata area on storage to keep the data
> > consistent?
>
> Yes, that's correct, although 'clients' is inaccurate as there is no
> 'server'. Through the locking mechanism, B would know to re-read block
> allocation states and the contents of the /foo directory, so a path
> lookup on B would then find a.txt.
>
> > After all, what makes gfs2 different from local filesystems like ext4,
> > because of lock_dlm?
>
> Exactly.
>
> > In general, if we mount an ext4 file system on two different clients and
> > update the file system on each client, we know that the file system state
> > is not reflected in each other.
>
> Yes.
>
> Cheers,
> Andy
>
> > Thank you,
> > Daegyu
> > ᐧ
> >
> > 2019년 8월 9일 (금) 오후 7:50, Andrew Price 님이 작성:
> >
> >> Hi Daegyu,
> >>
> >> On 09/08/2019 09:10, 한대규 wrote:
> >>> Hi, I'm Daegyu from Sungkyunkwan University.
> >>>
> >>> I'm curious how GFS2's filesystem metadata is shared between nodes.
> >>
> >> The key thing to know about gfs2 is that it is a shared storage
> >> filesystem where each node mounts the same storage device. It is
> >> different from a distributed filesystem where each node has storage
> >> devices that only it accesses.
> >>
> >>> In detail, I wonder how the metadata in the memory of the node mounting
> >> GFS2
> >>> looks the consistent filesystem to other nodes.
> >>
> >> gfs2 uses dlm for locking of filesystem metadata among the nodes. The
> >> transfer of locks between nodes allows gfs2 to decide when its in-memory
> >> caches are invalid and require re-reading from the storage.
> >>
> >>> In addition, what role does corosync play in gfs2?
> >>
> >> gfs2 doesn't communicate with corosync directly but it operates on top
> >> of a high-availability cluster. corosync provides synchronization and
> >> coherency for the cluster. If a node stops responding, corosync will
> >> notice and trigger actions (fencing) to make sure that node is put back
> >> into a safe and consistent state. This is important in gfs2 to prevent
> >> "misbehaving" nodes from corrupting the filesystem.
> >>
> >> Hope this helps.
> >>
> >> Cheers,
> >> Andy
> >>
> >>
> >>
> >
>


Re: [Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread Andrew Price

On 09/08/2019 12:46, Daegyu Han wrote:

Thank you for the clarification.

I have one more question.

I've seen some web page by redhat and it says that gfs2 has a poor
filesystem performance (i.e. throughput) compared to xfs or ext4.
[image: image.png]

In a high performance hardware environment (nvme over fabric, infiniband
(56G)), I ran a FIO benchmark, expecting GFS2 to be comparable to local
filesystems (ext4, xfs).

Unexpectedly, however, GFS2 showed 25% lower IOPS or throughput than ext4,
as the web page results.

Does GFS2 perform worse than EXT4 or XFS even on high-performance network +
storage?


gfs2 has performance overheads that ext4 and xfs don't encounter due to 
the extra work it has to do to keep the fs consistent across the 
cluster, such as the extra cache invalidation we've discussed, journal 
flushing and updates to structures relating to quotas and statfs. Even 
in a single-node configuration, extra codepaths are still active (but 
gfs2 isn't meant to be used as a single-node fs, so fio is not a good 
demonstration of its strengths). It's also worth noting that gfs2 is not 
extent-based so you may see performance differences relating to that. We 
are continually working to minimise the overheads, of course.


The size of the performance difference is highly dependent on the 
workload and access pattern. (Clustered) applications looking to get the 
best performance out of gfs2 will have each node processing its own 
working set - preferably in its own subdirectory - which will minimise 
the overheads.


Cheers,
Andy


Thank you,
Daegyu
ᐧ

2019년 8월 9일 (금) 오후 8:26, Andrew Price 님이 작성:


On 09/08/2019 12:01, Daegyu Han wrote:

Thank you for your reply.

If what I understand is correct,
In a gfs2 file system shared by clients A and B, if A creates /foo/a.txt,
does B re-read the filesystem metadata area on storage to keep the data
consistent?


Yes, that's correct, although 'clients' is inaccurate as there is no
'server'. Through the locking mechanism, B would know to re-read block
allocation states and the contents of the /foo directory, so a path
lookup on B would then find a.txt.


After all, what makes gfs2 different from local filesystems like ext4,
because of lock_dlm?


Exactly.


In general, if we mount an ext4 file system on two different clients and
update the file system on each client, we know that the file system state
is not reflected in each other.


Yes.

Cheers,
Andy


Thank you,
Daegyu
ᐧ

2019년 8월 9일 (금) 오후 7:50, Andrew Price 님이 작성:


Hi Daegyu,

On 09/08/2019 09:10, 한대규 wrote:

Hi, I'm Daegyu from Sungkyunkwan University.

I'm curious how GFS2's filesystem metadata is shared between nodes.


The key thing to know about gfs2 is that it is a shared storage
filesystem where each node mounts the same storage device. It is
different from a distributed filesystem where each node has storage
devices that only it accesses.


In detail, I wonder how the metadata in the memory of the node mounting

GFS2

looks the consistent filesystem to other nodes.


gfs2 uses dlm for locking of filesystem metadata among the nodes. The
transfer of locks between nodes allows gfs2 to decide when its in-memory
caches are invalid and require re-reading from the storage.


In addition, what role does corosync play in gfs2?


gfs2 doesn't communicate with corosync directly but it operates on top
of a high-availability cluster. corosync provides synchronization and
coherency for the cluster. If a node stops responding, corosync will
notice and trigger actions (fencing) to make sure that node is put back
into a safe and consistent state. This is important in gfs2 to prevent
"misbehaving" nodes from corrupting the filesystem.

Hope this helps.

Cheers,
Andy













[Cluster-devel] [bug report] gfs2: dump fsid when dumping glock problems

2019-08-09 Thread Dan Carpenter
Hello Bob Peterson,

The patch 3792ce973f07: "gfs2: dump fsid when dumping glock problems"
from May 9, 2019, leads to the following static checker warning:

fs/gfs2/glock.c:1796 gfs2_dump_glock()
error: format string overflow. buf_size: 270 length: 277

fs/gfs2/glock.c
  1785  void gfs2_dump_glock(struct seq_file *seq, struct gfs2_glock *gl, bool 
fsid)
  1786  {
  1787  const struct gfs2_glock_operations *glops = gl->gl_ops;
  1788  unsigned long long dtime;
  1789  const struct gfs2_holder *gh;
  1790  char gflags_buf[32];
  1791  char fs_id_buf[GFS2_FSNAME_LEN + 3 * sizeof(int) + 2];
   ^
This is the same as sizeof(sdp->sd_fsname);

  1792  struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
  1793  
  1794  memset(fs_id_buf, 0, sizeof(fs_id_buf));
  1795  if (fsid && sdp) /* safety precaution */
  1796  sprintf(fs_id_buf, "fsid=%s: ", sdp->sd_fsname);
^  ^^
So if sd_fsname is as large as "possible" we could be 7 characters over
the limit.

  1797  dtime = jiffies - gl->gl_demote_time;
  1798  dtime *= 100/HZ; /* demote time in uSec */
  1799  if (!test_bit(GLF_DEMOTE, &gl->gl_flags))
  1800  dtime = 0;
  1801  gfs2_print_dbg(seq, "%sG:  s:%s n:%u/%llx f:%s t:%s d:%s/%llu 
a:%d "

See also:
fs/gfs2/util.c:184 gfs2_consist_rgrpd_i() error: format string overflow. 
buf_size: 270 length: 277
fs/gfs2/rgrp.c:2293 gfs2_rgrp_error() error: format string overflow. buf_size: 
270 length: 277

regards,
dan carpenter



Re: [Cluster-devel] How can be metadata(e.g., inode) in the GFS2 file system shared between client nodes?

2019-08-09 Thread Daegyu Han
Thank you for your explanation.

ᐧ

2019년 8월 9일 (금) 오후 9:26, Andrew Price 님이 작성:

> On 09/08/2019 12:46, Daegyu Han wrote:
> > Thank you for the clarification.
> >
> > I have one more question.
> >
> > I've seen some web page by redhat and it says that gfs2 has a poor
> > filesystem performance (i.e. throughput) compared to xfs or ext4.
> > [image: image.png]
> >
> > In a high performance hardware environment (nvme over fabric, infiniband
> > (56G)), I ran a FIO benchmark, expecting GFS2 to be comparable to local
> > filesystems (ext4, xfs).
> >
> > Unexpectedly, however, GFS2 showed 25% lower IOPS or throughput than
> ext4,
> > as the web page results.
> >
> > Does GFS2 perform worse than EXT4 or XFS even on high-performance
> network +
> > storage?
>
> gfs2 has performance overheads that ext4 and xfs don't encounter due to
> the extra work it has to do to keep the fs consistent across the
> cluster, such as the extra cache invalidation we've discussed, journal
> flushing and updates to structures relating to quotas and statfs. Even
> in a single-node configuration, extra codepaths are still active (but
> gfs2 isn't meant to be used as a single-node fs, so fio is not a good
> demonstration of its strengths). It's also worth noting that gfs2 is not
> extent-based so you may see performance differences relating to that. We
> are continually working to minimise the overheads, of course.
>
> The size of the performance difference is highly dependent on the
> workload and access pattern. (Clustered) applications looking to get the
> best performance out of gfs2 will have each node processing its own
> working set - preferably in its own subdirectory - which will minimise
> the overheads.
>
> Cheers,
> Andy
>
> > Thank you,
> > Daegyu
> > ᐧ
> >
> > 2019년 8월 9일 (금) 오후 8:26, Andrew Price 님이 작성:
> >
> >> On 09/08/2019 12:01, Daegyu Han wrote:
> >>> Thank you for your reply.
> >>>
> >>> If what I understand is correct,
> >>> In a gfs2 file system shared by clients A and B, if A creates
> /foo/a.txt,
> >>> does B re-read the filesystem metadata area on storage to keep the data
> >>> consistent?
> >>
> >> Yes, that's correct, although 'clients' is inaccurate as there is no
> >> 'server'. Through the locking mechanism, B would know to re-read block
> >> allocation states and the contents of the /foo directory, so a path
> >> lookup on B would then find a.txt.
> >>
> >>> After all, what makes gfs2 different from local filesystems like ext4,
> >>> because of lock_dlm?
> >>
> >> Exactly.
> >>
> >>> In general, if we mount an ext4 file system on two different clients
> and
> >>> update the file system on each client, we know that the file system
> state
> >>> is not reflected in each other.
> >>
> >> Yes.
> >>
> >> Cheers,
> >> Andy
> >>
> >>> Thank you,
> >>> Daegyu
> >>> ᐧ
> >>>
> >>> 2019년 8월 9일 (금) 오후 7:50, Andrew Price 님이 작성:
> >>>
>  Hi Daegyu,
> 
>  On 09/08/2019 09:10, 한대규 wrote:
> > Hi, I'm Daegyu from Sungkyunkwan University.
> >
> > I'm curious how GFS2's filesystem metadata is shared between nodes.
> 
>  The key thing to know about gfs2 is that it is a shared storage
>  filesystem where each node mounts the same storage device. It is
>  different from a distributed filesystem where each node has storage
>  devices that only it accesses.
> 
> > In detail, I wonder how the metadata in the memory of the node
> mounting
>  GFS2
> > looks the consistent filesystem to other nodes.
> 
>  gfs2 uses dlm for locking of filesystem metadata among the nodes. The
>  transfer of locks between nodes allows gfs2 to decide when its
> in-memory
>  caches are invalid and require re-reading from the storage.
> 
> > In addition, what role does corosync play in gfs2?
> 
>  gfs2 doesn't communicate with corosync directly but it operates on top
>  of a high-availability cluster. corosync provides synchronization and
>  coherency for the cluster. If a node stops responding, corosync will
>  notice and trigger actions (fencing) to make sure that node is put
> back
>  into a safe and consistent state. This is important in gfs2 to prevent
>  "misbehaving" nodes from corrupting the filesystem.
> 
>  Hope this helps.
> 
>  Cheers,
>  Andy
> 
> 
> 
> >>>
> >>
> >
>


[Cluster-devel] [GFS2 PATCH] gfs2: eliminate circular lock dependency in inode.c

2019-08-09 Thread Bob Peterson
Hi,

This patch fixes problems caused by regressions from patch
"GFS2: rm on multiple nodes causes panic" from 2008,
72dbf4790fc6736f9cb54424245114acf0b0038c, which was an earlier
attempt to fix very similar problems.

The original problem for which it was written had to do with
simultaneous link, unlink, rmdir and rename operations on
multiple nodes that interfered with one another, due to the
lock ordering. The problem was that the lock ordering was
not consistent between the operations.

The defective patch put in place to solve it (and hey, it
worked for more than 10 years) changed the lock ordering so
that the parent directory glock was always locked before the
child. This almost always worked. Almost. The rmdir version
was still wrong because the rgrp glock was added to the holder
array, which was sorted, and the locks were acquired in sorted
order. That is counter to the locking requirements documented
in: Documentation/filesystems/gfs2-glocks.txt which states the
rgrp glock glock must always be locked after the inode glocks.

The real problem came with renames, though. Function
gfs2_rename(), which locked a series of inode glocks, did so
in parent-child order due to that patch. But it was still
possible to create circular lock dependencies just by doing the
wrong combination of renames on different nodes. For example:

Node a: mv /mnt/gfs2/sub /mnt/gfs2/tmp_name (rename sub to tmp_name)

a1. Same directory, so rename glock is NOT held
a2. /mnt/gfs2 is locked
a3. Tries to lock sub for rename, but it is locked on node b

Node b: mv /mnt/gfs2/sub /mnt/gfs2/dir1/ (move sub to dir1...
mv /mnt/gfs2/dir1/sub /mnt/gfs2/  ...then move it back)

b1. Different directory, so rename glock IS held
b2. /mnt/gfs2 is locked
b3. dir1 is locked
b4. sub is moved to dir1 and everything is unlocked
b5. Different directory, so rename glock IS held again
b6. dir1 is locked
b7. Lock for /mnt/gfs2 is requested, but cannot be granted because
node 1 locked it in step a2.

(Note that the nodes must be different, otherwise the vfs inode
level locking prevents the problem on a single node).

Thus, we get into a glock deadlock that looks like this:

host-018:
G:  s:EX n:2/3347 f:DyIqob t:EX d:UN/2368172000 a:0 v:0 r:3 m:150 i_gl, LM_ST_EXCLUSIVE, 0, ghs);
gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, ghs + 1);
 
-   error = gfs2_glock_nq(ghs); /* parent */
+   error = gfs2_glock_nq_m(2, ghs); /* inodes */
if (error)
-   goto out_parent;
-
-   error = gfs2_glock_nq(ghs + 1); /* child */
-   if (error)
-   goto out_child;
+   goto out_uninit;
 
error = -ENOENT;
if (inode->i_nlink == 0)
@@ -1004,10 +1000,8 @@ static int gfs2_link(struct dentr