On Tue, Feb 9, 2021 at 10:39 AM Yedidyah Bar David <d...@redhat.com> wrote:
>
> On Tue, Feb 9, 2021 at 8:55 AM Yedidyah Bar David <d...@redhat.com> wrote:
> >
> > On Mon, Feb 8, 2021 at 7:05 PM Nir Soffer <nsof...@redhat.com> wrote:
> > >
> > > On Mon, Feb 8, 2021 at 6:47 PM Asaf Rachmani <arach...@redhat.com> wrote:
> > > >
> > > >
> > > > I didn't check it yet, but maybe this customer bug is related to this 
> > > > discussion:
> > > > https://bugzilla.redhat.com/show_bug.cgi?id=1666553
> > >
> > > This bug is about the broker opening too many connections
> > > to vdsm:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1666553#c1
> > >
> > > If the broker keeps 1000's of connections open it it likey that
> > > vdsm will run out of fds.
> > >
> > > > From vdsm.log:
> > > > 2019-01-15 13:41:11,162+0000 INFO  (periodic/2) [vdsm.api] FINISH 
> > > > multipath_health return={} from=internal, 
> > > > task_id=97c359aa-002e-46d8-9fc5-2477db0909b4 (api:52)
> > > > 2019-01-15 13:41:12,210+0000 WARN  (vdsm.Scheduler) [Executor] Worker 
> > > > blocked: <Worker name=jsonrpc/0 running <Task <JsonRpcTask {'params': 
> > > > {}, 'jsonrpc': '2.0', 'method': u'Host.getCapabilities', 'id': 
> > > > u'74b9dc62-22b2-4698-9d84-6a71c4f29763'} at 0x7f71dc31b0d0> timeout=60, 
> > > > duration=60 at 0x7f71dc31b110> task#=33 at 0x7f722003c890>, traceback:
> > > > File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap
> > > >   self.__bootstrap_inner()
> > > > File: "/usr/lib64/python2.7/threading.py", line 812, in 
> > > > __bootstrap_inner
> > > >   self.run()
> > > > File: "/usr/lib64/python2.7/threading.py", line 765, in run
> > > >   self.__target(*self.__args, **self.__kwargs)
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", 
> > > > line 194, in run
> > > >   ret = func(*args, **kwargs)
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in 
> > > > _run
> > > >   self._execute_task()
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in 
> > > > _execute_task
> > > >   task()
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in 
> > > > __call__
> > > >   self._callable()
> > > > File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 
> > > > 523, in __call__
> > > >   self._handler(self._ctx, self._req)
> > > > File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 
> > > > 566, in _serveRequest
> > > >   response = self._handle_request(req, ctx)
> > > > File: "/usr/lib/python2.7/site-packages/yajsonrpc/__init__.py", line 
> > > > 606, in _handle_request
> > > >   res = method(**params)
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/rpc/Bridge.py", line 197, 
> > > > in _dynamicMethod
> > > >   result = fn(*methodArgs)
> > > > File: "<string>", line 2, in getCapabilities
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/common/api.py", line 48, 
> > > > in method
> > > >   ret = func(*args, **kwargs)
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/API.py", line 1337, in 
> > > > getCapabilities
> > > >   c = caps.get()
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/host/caps.py", line 168, 
> > > > in get
> > > >   net_caps = supervdsm.getProxy().network_caps()
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 
> > > > 55, in __call__
> > > >   return callMethod()
> > > > File: "/usr/lib/python2.7/site-packages/vdsm/common/supervdsm.py", line 
> > > > 53, in <lambda>
> > > >   **kwargs)
> > > > File: "<string>", line 2, in network_caps
> > > > File: "/usr/lib64/python2.7/multiprocessing/b", line 759, in _callmethod
> > > >   kind, result = conn.recv() (executor:363)
> > > >
> > > >
> > > >
> > > >
> > > > On Mon, Feb 8, 2021 at 5:52 PM Nir Soffer <nsof...@redhat.com> wrote:
> > > >>
> > > >> On Mon, Feb 8, 2021 at 1:22 PM Yedidyah Bar David <d...@redhat.com> 
> > > >> wrote:
> > > >> >
> > > >> > On Mon, Feb 8, 2021 at 9:05 AM Yedidyah Bar David <d...@redhat.com> 
> > > >> > wrote:
> > > >> > >
> > > >> > > Hi all,
> > > >> > >
> > > >> > > I ran a loop of [1] (from [2]). The loop succeeded for ~ 380
> > > >> > > iterations, then failed with 'Too many open files'. First failure 
> > > >> > > was:
> > > >> > >
> > > >> > > 2021-02-08 02:21:15,702+0100 ERROR (jsonrpc/4) [storage.HSM] Could 
> > > >> > > not
> > > >> > > connect to storageServer (hsm:2446)
> > > >> > > Traceback (most recent call last):
> > > >> > >   File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line
> > > >> > > 2443, in connectStorageServer
> > > >> > >     conObj.connect()
> > > >> > >   File 
> > > >> > > "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
> > > >> > > line 449, in connect
> > > >> > >     return self._mountCon.connect()
> > > >> > >   File 
> > > >> > > "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py",
> > > >> > > line 171, in connect
> > > >> > >     self._mount.mount(self.options, self._vfsType, 
> > > >> > > cgroup=self.CGROUP)
> > > >> > >   File "/usr/lib/python3.6/site-packages/vdsm/storage/mount.py", 
> > > >> > > line
> > > >> > > 210, in mount
> > > >> > >     cgroup=cgroup)
> > > >> > >   File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
> > > >> > > line 56, in __call__
> > > >> > >     return callMethod()
> > > >> > >   File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py",
> > > >> > > line 54, in <lambda>
> > > >> > >     **kwargs)
> > > >> > >   File "<string>", line 2, in mount
> > > >> > >   File "/usr/lib64/python3.6/multiprocessing/managers.py", line 
> > > >> > > 772,
> > > >> > > in _callmethod
> > > >> > >     raise convert_to_error(kind, result)
> > > >> > > OSError: [Errno 24] Too many open files
> > > >>
> > > >> Maybe we have a fd leak in supervdsmd?
> > > >>
> > > >> We know that there a small memory leak in multiprocessing, but not
> > > >> about any fd leak.
> > > >>
> > > >> > > But obviously, once it did, it continued failing for this reason on
> > > >> > > many later operations.
> > > >>
> > > >> Smells like fd leak.
> > > >>
> > > >> > > Is this considered a bug?
> > > >>
> > > >> Generally yes, but the question is if this happens during
> > > >> real world scenarios.
> > > >>
> > > >> > Do we actively try to prevent such cases?
> > > >>
> > > >> No, we don't have any code monitoring number of open fds
> > > >> in runtime, or tests checking this in system tests.
> > > >>
> > > >> We do have health monitor in vdsm:
> > > >> https://github.com/oVirt/vdsm/blob/master/lib/vdsm/health.py
> > > >>
> > > >> It can be useful to log monitor also the number of fds (.e.g ls -lh
> > > >> /proc/pid/fd).
> > > >>
> > > >> We don't have any monitor in supervdsm, it can be useful to add
> > > >> one. supervdsm is relatively simple, but the problem is it runs
> > > >> possibly complex code from vdsm, so "safe" changes in vdsm can
> > > >> cause trouble when the code is run by supervdsm.
> > > >>
> > > >> > So should I open one and attach logs? Or it can be considered a 
> > > >> > "corner
> > > >> > > case"?
> > > >>
> > > >> Yes, please open a bug, and include the info you have.
> >
> > Done, https://bugzilla.redhat.com/show_bug.cgi?id=1926589 .
> >
> > > >>
> > > >> Please include output of "ls -lh /proc/pid/fd" for both vdsm
> > > >> and supervdsm when you reproduce the issue, or during the
> > > >> long test if you cannot reproduce.
> > > >>
> > > >> > > Using vdsm-4.40.50.3-37.git7883b3b43.el8.x86_64 from
> > > >> > > ost-images-el8-he-installed-1-202102021144.x86_64 .
> > > >> > >
> > > >> > > I can also let access to the machine(s) if needed, for now.
> > > >> >
> > > >> > Sorry, now cleaned this env. Can try to reproduce if there is 
> > > >> > interest.
> > > >>
> > > >> It will help you can reproduce.
> >
> > Trying now.
>
> It didn't fail yet (finished 45 iterations), but it already seems like 
> supervdsm
> is leaking. When I started, it had 30+ open files (32 on host-0, 35 on 
> host-1),
> now it has 160+ (169, 166). Not sure how to see what's leaking. Almost all of
> them are pipes. I'll run strace on it for some time, hopefully this can help.

Looked a bit some more and added an attachment and some guesses to the bug.

Best regards,
-- 
Didi
_______________________________________________
Devel mailing list -- devel@ovirt.org
To unsubscribe send an email to devel-le...@ovirt.org
Privacy Statement: https://www.ovirt.org/privacy-policy.html
oVirt Code of Conduct: 
https://www.ovirt.org/community/about/community-guidelines/
List Archives: 
https://lists.ovirt.org/archives/list/devel@ovirt.org/message/I2PKI6K7GIBH7PK6D7Q6UWIU7FDCQTNR/

Reply via email to