Re: [e-users] E crash with Nvidia

Florian Schaefer Thu, 09 Sep 2021 16:29:58 -0700

On Thu, Sep 09, 2021 at 08:32:47AM +0100, Carsten Haitzler wrote:
> On Thu, 9 Sep 2021 09:20:28 +0900 Florian Schaefer <list...@netego.de> said:
> 
> > On Wed, Sep 08, 2021 at 11:08:00AM +0100, Carsten Haitzler wrote:
> > > On Wed, 8 Sep 2021 17:35:12 +0900 Florian Schaefer <list...@netego.de> 
> > > said:
> > > 
> > > > Seems to me to have been good last words this time. ;) So I am running
> > > > this all day now and I think I did not have a segfault due to procstat
> > > > so far. Thanks for the fixes and I like the new indicator icon. :)
> > > > 
> > > > That being said, I still had some crashes today and I am thinking that
> > > > perhaps finally I might have something true to the topic of this thread.
> > > > At least it crashes within libnvidia and I do not get an ASAN trace.
> > > > 
> > > > For what it's worth, I tried to record a trace as good as I can.
> > > > 
> > > > https://pastebin.com/p41b7GKW
> > > > 
> > > > This happens reproducibly when I change from X running E to the text
> > > > console and then back to the graphics screen. (I did quite a lot of
> > > > these switches lately for running gdb while E is stil crashed.) When I
> > > > have an "empty" E running it is fine. However, as soon as some window is
> > > > open it reliably segfaults upon returning to X. Any ideas?
> > > 
> > > time to stop asan and use valgrind. that can at least say if the memory
> > > nvidia is accessing is beyond some array e provided - the shader flush
> > > basically has e provide a block of mem containing vertexes etc. for the 
> > > gpu
> > > to draw. this array is expanded as new triangle are added then flushed to
> > > the gpu at some point during rendering. that might be the only thing i can
> > > think of that might be an efl bug - we use a dud pointer? but then you
> > > could figure this out from valgrind + gdb... maybe. valgrind would see the
> > > errant pointer and perhaps if its just beyond some other block of mem or 
> > > if
> > > that block was freed recently etc.
> > 
> > So there are things that valgrind can that asan cannot. More stuff to
> > learn. :)
> 
> Yeah. Valgrind is actually a cpu interpreter. it literally interprets every
> instruction and while doing that tracks memory state. it also traps 
> malloc/free
> and so on too and tracks what memory has been allocated, freed down to the
> byte, if it has been written to or not etc. - doing qll of this is can see
> every issue. it may have no DEBUG to tell you more than "code in this library
> causers problem X", or with full gdb debug it can use that memory address to
> tell you the file, line number, function name and so on too. This is why
> valgrind is slow. it's literally interpreting everything a process under
> valgrind does.
> 
> Asan has the compiler do the above instead. So when the compiler generates the
> binary code for an application or library, it ADDS code that runs natively 
> that
> does tracking. This means tat simple instructions that just do add/sub/compare
> etc. just get generated as normal. instructions that access memory get 
> tracking
> code added like valgrind. this means only the code that the compiler generates
> will get tracked (e.g. efl and enlightenment), and other code that efl calls
> (stuff in libc, libjpeg, opengl libs etc.) will not be. this is a major
> difference in design and makes asan massively faster. it's actually usable day
> to day on a decently fast machine. it does mean e uses a lot more memory as
> asan needs extra memory in the process to do the tracking of every byte and 
> its
> history and it does need to execute more instructions whenever reading/writing
> to some memory etc. ... but not all the code your cpu runs will have this 
> extra
> work because it's only these actions and any libraries called that do not have
> asan build will also not do this extra work. thus - asan can't find anything
> in a library you did not build with asan support. thus sometimes you still 
> have
> to pull out ye-olde valgrind. valgrind is an amazing tool. it's just slow. if
> you seem to have issues in e/efl the first port of call is to try asan. it's
> fast enough to run day to day and not very intrusive in that you can rebuild
> efl+e and then just ctrl+alt+end to restart e and presto - asan is on. as long
> as you have pre set-up a proper ASAN_OPTIONS env var ... also i suggest you:
> 
> export EINA_FREEQ_TOTAL_MAX=0
> export EINA_FREEQ_MEM_MAX=0
> export EINA_FREEQ_FILL_MAX=0
> 
> as well. this may make e/efl a little more crashy and will also remove a minor
> optimization (freeq is a ... free queue - it takes things that need to be 
> freed
> and adds them to a queue to free some time later = freeq will collect things 
> to
> free up until some limit. it will, when items are added to the queue, fill
> their memory with some pattern like 0x555555 or 0x777777 etc. - or well up to
> the first N bytes of that memory object, and then when it actually does the
> free later will check that that pattern still is there. if it's not, something
> wrote to that memory that SHOULD have been left alone as the object was queued
> to be freed - it can give you an indication that something is wrong but not
> exactly where). as freeq waits until the app is idle (has nothing to do but
> wait for input or things to happen) it runs through the queue then freeing
> objects so avoiding the work of the free until then. it's an efl self-check
> mechanism put in to hunt down bugs and get a little optimzation in return for
> the extra work it has to do. by setting the above to zero you basically 
> disable
> freeq and force it to free immediately which is what you want for both 
> valgrind
> and asan so they detect the problems right. note efl knows when it runs under
> valgrind and auto disables freeq on its own. but with asan, it does not.
> 
> i hope that helps explain the above (roughly - i glossed over a lot of details
> to make it easier to explain in a short amount of time)


Ahm, yeah, thanks for the explanations. I wasn't expecting such a ...
verbose ... reply. But it is appreciated. Even though I did probably not
fully understand everything I now see that valgrind is more than meets
the eye and that the same is true for eina. ;)

> > Anyway, I tried to follow the debugging instructions on E.org as good as
> > I can (after having finally recompiled everything without asan, but
> > leaving the debugging symbols in place).
> > 
> > Three observations:
> > 
> > 1. The valgrind option --db-attach seems to be deprecated since 2015 and
> > is not avaiable any more. So I just omitted this. I hope that's fine.
> 
> i know. :( you now need a separate shell running gdb to attach gdb to the
> process then tell it to run. painful. :(
> 
> > 2. Then I tried to use the ".xinitrc-debug" method. Upon starting E the
> > startup apparently went into an infinite loop, generating pages and
> > pages of valgrind and E startup messages (a few valgrind messages with
> > something-something exiting 0) and generating many 120MB core dumps. So
> > I never got to the point where I would actually get anything but a black
> > screen from X.
> 
> aaah with valgrind you want to probably bypass enlightenment_start - this 
> means
> any issue will drop you out of your login session but you will have a chance 
> to
> debug it. to avoid enlightenment_start do:
> 
> export E_START=1
> valgrind --tool=memcheck ... enlightenment
> 
> 
> FYI when i valgrind i do:
> 
> valgrind --suppressions=$HOME/.zsh/vgd.supp --tool=memcheck --num-callers=64
> --show-reachable=no --read-var-info=yes --leak-check=yes 
> --leak-resolution=high
> --undef-value-errors=yes --track-origins=yes --vgdb-error=0  --vgdb=full
> --redzone-size=512 --freelist-vol=100000000
> 
> :) the suppressions file is a file i keep to tell valgrind to ignore that 
> issue
> - e.g. it's a common optimization in libc or freetype or something that it
> should just pretend is not an issue. you can drop that option because you 
> won't
> maintain that file and that file is highly system specific.

Hmm, this valgrind stuff is more difficult then I expected. First I was
struggling to get the X server and enlightenment to start properly. I
finally settled on just creating the .xinitrc and let the rest be sorted
out with startx.

But then, again, if I just start enlightenment without valgrind it
works. With valgrind enabled everything stops at a black screen and the
only way to get a responsive interface again is to reboot the machine.

So here's what I do: https://pastebin.com/yzhy4gj1

The first part shows my .xinitrc. At the end you see two alternative
exec commands. The one with valgrind causes everything to hang. The one
without works just fine.

Even though with valgrind enabled I cannot really do anything at least
there is still heaps of stuff in the logfile, so that output is also
included. Many "lost bytes" (not really dangerous, right?) and an
unhandled instruction in e_comp_x_randr.c. Hmmm.

Cheers
Florian

> > 3. Then I tried it again, removing from the .xinitrc-debug script all
> > options from valgrind but the --tool=memcheck one, thus being closer to
> > the first example of using valgrind. This caused a complete lockup of my
> > computer and my only rescue was a reboot via SysRq.
> > 
> > I guess I will have to try this again with a somewhat different
> > approach...
> > 
> > Cheers,
> > Florian
> > 
> > PS: Can I hijack this thread to quickly paste an eina trace I get all
> > the time when openening everying? ;) https://pastebin.com/rvupgMcx


_______________________________________________
enlightenment-users mailing list
enlightenment-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-users

Re: [e-users] E crash with Nvidia

Reply via email to