On Wed, 22 May 2019 at 21:15, <[1][email protected]> wrote:
Kirill Kolyshkin:
> Can you tell me what is your reason to not retry reading by
default? The
> code
> has just checked that this is an aufs mount so it should
definitely be
> present in
> /proc/mounts. Unless, of course, this mount was unmounted by
someone in
> between statfs() and reading. If you have this exact case in mind
(I can't
> think
> of anything else) and don't want to retry because of efficiency,
you can add
> another statfs() to after reading /proc/mounts and not finding the
mount --
> that way you can be sure that the mount is still there but it
eluded the
> /proc/mounts.
Yes, such race was in my mind.
In other words, it is hard to identify the reason why /proc/mounts
doesn't show the entry.A The problem of /proc/mounts, or someone
else
unmounted?
The only way to make sure is to retry reading /proc/mounts. Or, do
stat+statfs
(exactly as you suggested below) to check that this is indeed the aufs
root
directory (and then you still need to retry reading /proc/mounts).
A
Additionally I guess(hope) such parallel mount/unmounts are
rare.A And I wonder "2" is the absolute correct solution?A "3"
cannot be
happen? Never?
If you're asking about number of retries, I chose 3 in my patch just to
be
on the safe side. Let's assume that a probability of not finding a
specific
mount is 0.01, or 1%, then in case of a second retry it will be
0.01^2 = 0.0001, or 0.01%. With three retries, it will be 0.01^3, or
0.0001%.A
Practically, I never saw it in my test that second retry won't work,
but maybe
we can do more testing to figure out what a good number of retries
should be.
A
Statfs(), you say, won't help I am afraid.A Even if it tells us
that the
dir is aufs, it is not the proof of the aufs mountpoint.A It can be
a
subdir of another aufs mount.
An extra stat(2) call may help in this point.A It will tell us the
inode
number, and if it is AUFS_ROOT_INO, then that path is the aufs
mountpoint.
But I wonder do we really have to issue stat(2) and statfs(2) just
to
make sure the aufs mount is still there?A Isn't it rather heavy and
racy?
It is racy, not very heavy though, and I suggest to only do it in case
we
tried to find the mount in /proc/mounts and failed, before retry.
Reading
/proc/mounts is super heavy, compared to a couple of syscalls, looks
like a decent price to pay to avoid re-reading it.
A
> I have also took a deeper look at that other error I mentioned
earlier.
> Found out
> it's a race in au_xino_create(). In case xino mount option is
provided (we
> use
> xino=/dev/shm/aufs.xino in Docker), and multiple mounts (note: for
different
> mount points) are performed in parallel, one mount can reach
>
> >A file = vfsub_filp_open(fpath, O_RDWR | O_CREAT | O_EXCL |
O_LARGEFILE,
> 0666);
>
> line of code, while another mount already created that file, but
haven't
> unlinked it yet.
>
> As a result, we have an error like these in the kernel log:
>
> [2233986.956753] aufs au_xino_create:767:dockerd[17144]: open
> /dev/shm/aufs.xino(-17)
> [2233988.732636] aufs au_xino_create:767:dockerd[17518]: open
> /dev/shm/aufs.xino(-17)
Thank you very much for the report.
Here -17 means EEXIST "File exists" error.
It is an expected behaviour (and I am glad that I know it is working
expectedly). As you might know, the default path of XINO files are
the
top dir of the first writable branch, and a writable branch is not
sharable between the multiple aufs mounts.A So by default XINO
files are
dedicated to a single aufs mount. Not shared, no confliction
happens.
This is correct for read-write mounts, what about read-only ones?A
The doc says that if there's no writable branch, path used will be
/tmp/.aufs.xino.
Sure, aufs kernel module code creates and then removes this file, so
another
mount will create another file (with the same name), but I see that it
is a clear
possibility to have a race in this case (such as multiple read-only
mounts happening
at the same time).
A
> Currently, I am working around this unfortunate issue by calling
mount(2)
> under
> an exclusive lock, to make sure no two aufs mounts (again, for
different
> mount
> points) are performed in parallel, but perhaps there is a better
way?
>
> I am going to mitigate this race by adding a random suffix to xino
file
> name; do you think
> it is a decent workaround?
If your first writable branch is somewhere on /dev/shm, then you can
remove "xino=" option.A In this case, the XINO files will be
created
under /dev/shm and not shared.A Moreover "xino=" option is
something
like a last resort generally.A As long as the filesystem of your
first
writable branch doesn't support XINO, or you want a little gain
around
the aufs internal XINO handling, you may want "xino=".A Otherwise
you
can omit it.
My guess is, whoever wrote docker aufs support, thought that using
tmpfs
for xino file is faster. Or maybe they saw an advise (from some old
documentation,
such asA [2]http://aufs.sourceforge.net/aufs2/brsync/README.txt) to
never
put xino file to an SSD (I'm sure many docker users use ssd hard
drivers).
A
Of course adding a random/unique suffix is a good idea.A If I were
you,
I'd use $$ in shell script manner such like
A A A A mount -t aufs -obr=...,xino=/dev/shm/aufs.xino.$$ ...
My code is in Go, so I'm patching it this way
- A A A opts := "dio,xino=/dev/shm/aufs.xino"
+ A A A rnd := strconv.FormatInt(int64(1e9+rand.Uint32()%1e9),
36)[1:5]
+ A A A opts := "dio,xino=/dev/shm/aufs."+rnd
(rnd is some 4 random characters in 0-9a-z range)
References
1. mailto:[email protected]
2. http://aufs.sourceforge.net/aufs2/brsync/README.txt