On Fri, 10 Dec 2010 16:28:22 +0900
Carsten Haitzler (The Rasterman) <ras...@rasterman.com> wrote:

> On Fri, 10 Dec 2010 01:30:33 -0500 Mike Blumenkrantz <m...@zentific.com> said:
> 
> > On Fri, 10 Dec 2010 14:56:34 +0900
> > Carsten Haitzler (The Rasterman) <ras...@rasterman.com> wrote:
> > 
> > > On Thu, 09 Dec 2010 13:36:14 +0900 Mike McCormack
> > > <mj.mccorm...@samsung.com> said:
> > > 
> > > > On 12/09/2010 11:57 AM, Mike Blumenkrantz wrote:
> > > > 
> > > > >>                _ecore_main_loop_shutdown();
> > > > >>                _ecore_main_loop_init();
> > > > >>             }
> > > > 
> > > > > Okay, but the only place that _ecore_main_fdh_poll_del can be called
> > > > > from is ecore_main_fd_handler_del.  If a bad fd gets into epoll and
> > > > > then is selected on without being removed, it will just loop
> > > > > infinitely without ever being able to be removed, just like we're
> > > > > seeing in this bug.  Might be good to add shutdown+init inside
> > > > > bads_rem if no bads are found, yes?
> > > > 
> > > > Getting a stale fd in epoll is a bug.  It means somebody closed the fd
> > > > without removing it from ecore_main_loop, and that should be fixed.
> > > > 
> > > > Rather than adding more bandaids, it would be better to fix the root
> > > > cause of the problem.
> > > 
> > > indeed you are right. what would be RIGHT to do here is like the rest of
> > > EFL and where it has magic number checks. this is similar. when it
> > > encounters such a bad fd ecore could/should blurt out an ugly warning and
> > > if ECORE_ERROR_ABORT is set in the env call abort instantly there - thus
> > > providing easy debugging. it CAN now clean up and march on after a
> > > rebuild. it gave the developer a good chance to fix things. at least it
> > > is robust enough to march on with no side effects from that point.
> > I agree that this would be a nice feature.
> > > zmike: you may want to change your proposal to be like the above. i'd be
> > > more comfortable with it that way. it solves the "bad userspace code"
> > > problem as well as providing good debuggability (mike's point).
> > Again, the problem here is that making ecore just die when this happens IN
> > NO WAY HELPS A DEVELOPER FIX THE BUG.  You have no way of knowing which
> 
> you can abort with a print of the fd # and look in /proc/PID/fd/ for it :)
> maybe match it up to debugging you have for things that add fd handlers...
> etc. - the abort is only done IF u set the env var - if u dont it does what u
> suggest. an abort means app stops and developer really does need to fix it.
> thats the point of them :)
> 
> > app/module/etc has caused the behavior to occur (in the case of e17, for
> > example), and so your only recourse is to search through literally anything
> > that uses fds with ecore.  This is not the correct way to fix it, as it only
> > creates more headaches.
> > 
> > If there were some way to make it abort AND log which fd has been improperly
> > closed, that would be the only way of doing it so that a developer can debug
> > it. This could be done by making ecore_main_fd_handler_add into a macro
> > which uses __PRETTY_FUNCTION__ (or similar based on arch/compiler) to
> > supply the actual function name along with the fd to avoid needing to
> > rewrite any code, but this would likely introduce other bugs?  Best way I
> > can think of off the top of my head though.
> > 
> > Either way, this is a bug which MUST be fixed before ecore 1.0 since it can
> > break everything running on the machine when it occurs due to the insane
> > rate at which log files grow.
> > -- 
> > Mike Blumenkrantz
> > Zentific: We run the three-legged race individually.
> > 
> 
> 
The problem, again, is that we're dealing with a very rare bug.  Unless you're
suggesting that we run all apps in gdb all the time to catch this until it gets
fixed, making apps SIGABRT will not be helpful to devs in the slightest.

As an example, I've already verified that this occurs in e itself on a number of
computers with totally different setups.  At best, we catch SIGABRT from your
proposed "fix".  Then what?  We've got a SIGABRT from a function which gives us
nothing but the epoll fd.  We don't know which fd in epoll is bad, we have no
way of getting it; we just know that there's a bad fd.  Looking
through /proc/$PID/fd just tells us what the fd links to (unless it's a
socket), and our best option if it isn't an obvious file is to try reading
it.  Assuming that the fd contains nothing but binary data, which is likely
given that almost every fd e opens is pipe/socket on the systems I've seen,
we've gained nothing.  The only thing that this has helped is the aborting of
the program, which is stupid because what we really need is just to restart it.

I'm all for fixing this in the "correct" way, but I don't think your idea is
the best one, and I think it's probably the hardest possible way to try and
debug with (assuming that it's even possible to debug it this way).
-- 
Mike Blumenkrantz
Zentific: We run the three-legged race individually.

------------------------------------------------------------------------------
_______________________________________________
enlightenment-devel mailing list
enlightenment-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-devel

Reply via email to