Re: [e-users] E crash with Nvidia

Carsten Haitzler Thu, 09 Sep 2021 16:38:14 -0700

On Fri, 10 Sep 2021 08:28:30 +0900 Florian Schaefer <list...@netego.de> said:


> On Thu, Sep 09, 2021 at 08:32:47AM +0100, Carsten Haitzler wrote:
> > On Thu, 9 Sep 2021 09:20:28 +0900 Florian Schaefer <list...@netego.de> said:
> > 
> > > On Wed, Sep 08, 2021 at 11:08:00AM +0100, Carsten Haitzler wrote:
> > > > On Wed, 8 Sep 2021 17:35:12 +0900 Florian Schaefer <list...@netego.de>
> > > > said:
> > > > 
> > > > > Seems to me to have been good last words this time. ;) So I am running
> > > > > this all day now and I think I did not have a segfault due to procstat
> > > > > so far. Thanks for the fixes and I like the new indicator icon. :)
> > > > > 
> > > > > That being said, I still had some crashes today and I am thinking that
> > > > > perhaps finally I might have something true to the topic of this
> > > > > thread. At least it crashes within libnvidia and I do not get an ASAN
> > > > > trace.
> > > > > 
> > > > > For what it's worth, I tried to record a trace as good as I can.
> > > > > 
> > > > > https://pastebin.com/p41b7GKW
> > > > > 
> > > > > This happens reproducibly when I change from X running E to the text
> > > > > console and then back to the graphics screen. (I did quite a lot of
> > > > > these switches lately for running gdb while E is stil crashed.) When I
> > > > > have an "empty" E running it is fine. However, as soon as some window
> > > > > is open it reliably segfaults upon returning to X. Any ideas?
> > > > 
> > > > time to stop asan and use valgrind. that can at least say if the memory
> > > > nvidia is accessing is beyond some array e provided - the shader flush
> > > > basically has e provide a block of mem containing vertexes etc. for the
> > > > gpu to draw. this array is expanded as new triangle are added then
> > > > flushed to the gpu at some point during rendering. that might be the
> > > > only thing i can think of that might be an efl bug - we use a dud
> > > > pointer? but then you could figure this out from valgrind + gdb...
> > > > maybe. valgrind would see the errant pointer and perhaps if its just
> > > > beyond some other block of mem or if that block was freed recently etc.
> > > 
> > > So there are things that valgrind can that asan cannot. More stuff to
> > > learn. :)
> > 
> > Yeah. Valgrind is actually a cpu interpreter. it literally interprets every
> > instruction and while doing that tracks memory state. it also traps
> > malloc/free and so on too and tracks what memory has been allocated, freed
> > down to the byte, if it has been written to or not etc. - doing qll of this
> > is can see every issue. it may have no DEBUG to tell you more than "code in
> > this library causers problem X", or with full gdb debug it can use that
> > memory address to tell you the file, line number, function name and so on
> > too. This is why valgrind is slow. it's literally interpreting everything a
> > process under valgrind does.
> > 
> > Asan has the compiler do the above instead. So when the compiler generates
> > the binary code for an application or library, it ADDS code that runs
> > natively that does tracking. This means tat simple instructions that just
> > do add/sub/compare etc. just get generated as normal. instructions that
> > access memory get tracking code added like valgrind. this means only the
> > code that the compiler generates will get tracked (e.g. efl and
> > enlightenment), and other code that efl calls (stuff in libc, libjpeg,
> > opengl libs etc.) will not be. this is a major difference in design and
> > makes asan massively faster. it's actually usable day to day on a decently
> > fast machine. it does mean e uses a lot more memory as asan needs extra
> > memory in the process to do the tracking of every byte and its history and
> > it does need to execute more instructions whenever reading/writing to some
> > memory etc. ... but not all the code your cpu runs will have this extra
> > work because it's only these actions and any libraries called that do not
> > have asan build will also not do this extra work. thus - asan can't find
> > anything in a library you did not build with asan support. thus sometimes
> > you still have to pull out ye-olde valgrind. valgrind is an amazing tool.
> > it's just slow. if you seem to have issues in e/efl the first port of call
> > is to try asan. it's fast enough to run day to day and not very intrusive
> > in that you can rebuild efl+e and then just ctrl+alt+end to restart e and
> > presto - asan is on. as long as you have pre set-up a proper ASAN_OPTIONS
> > env var ... also i suggest you:
> > 
> > export EINA_FREEQ_TOTAL_MAX=0
> > export EINA_FREEQ_MEM_MAX=0
> > export EINA_FREEQ_FILL_MAX=0
> > 
> > as well. this may make e/efl a little more crashy and will also remove a
> > minor optimization (freeq is a ... free queue - it takes things that need
> > to be freed and adds them to a queue to free some time later = freeq will
> > collect things to free up until some limit. it will, when items are added
> > to the queue, fill their memory with some pattern like 0x555555 or 0x777777
> > etc. - or well up to the first N bytes of that memory object, and then when
> > it actually does the free later will check that that pattern still is
> > there. if it's not, something wrote to that memory that SHOULD have been
> > left alone as the object was queued to be freed - it can give you an
> > indication that something is wrong but not exactly where). as freeq waits
> > until the app is idle (has nothing to do but wait for input or things to
> > happen) it runs through the queue then freeing objects so avoiding the work
> > of the free until then. it's an efl self-check mechanism put in to hunt
> > down bugs and get a little optimzation in return for the extra work it has
> > to do. by setting the above to zero you basically disable freeq and force
> > it to free immediately which is what you want for both valgrind and asan so
> > they detect the problems right. note efl knows when it runs under valgrind
> > and auto disables freeq on its own. but with asan, it does not.
> > 
> > i hope that helps explain the above (roughly - i glossed over a lot of
> > details to make it easier to explain in a short amount of time)
> 
> Ahm, yeah, thanks for the explanations. I wasn't expecting such a ...
> verbose ... reply. But it is appreciated. Even though I did probably not
> fully understand everything I now see that valgrind is more than meets
> the eye and that the same is true for eina. ;)
> 
> > > Anyway, I tried to follow the debugging instructions on E.org as good as
> > > I can (after having finally recompiled everything without asan, but
> > > leaving the debugging symbols in place).
> > > 
> > > Three observations:
> > > 
> > > 1. The valgrind option --db-attach seems to be deprecated since 2015 and
> > > is not avaiable any more. So I just omitted this. I hope that's fine.
> > 
> > i know. :( you now need a separate shell running gdb to attach gdb to the
> > process then tell it to run. painful. :(
> > 
> > > 2. Then I tried to use the ".xinitrc-debug" method. Upon starting E the
> > > startup apparently went into an infinite loop, generating pages and
> > > pages of valgrind and E startup messages (a few valgrind messages with
> > > something-something exiting 0) and generating many 120MB core dumps. So
> > > I never got to the point where I would actually get anything but a black
> > > screen from X.
> > 
> > aaah with valgrind you want to probably bypass enlightenment_start - this
> > means any issue will drop you out of your login session but you will have a
> > chance to debug it. to avoid enlightenment_start do:
> > 
> > export E_START=1
> > valgrind --tool=memcheck ... enlightenment
> > 
> > 
> > FYI when i valgrind i do:
> > 
> > valgrind --suppressions=$HOME/.zsh/vgd.supp --tool=memcheck --num-callers=64
> > --show-reachable=no --read-var-info=yes --leak-check=yes
> > --leak-resolution=high
> > --undef-value-errors=yes --track-origins=yes --vgdb-error=0  --vgdb=full
> > --redzone-size=512 --freelist-vol=100000000
> > 
> > :) the suppressions file is a file i keep to tell valgrind to ignore that
> > issue
> > - e.g. it's a common optimization in libc or freetype or something that it
> > should just pretend is not an issue. you can drop that option because you
> > won't maintain that file and that file is highly system specific.
> 
> Hmm, this valgrind stuff is more difficult then I expected. First I was
> struggling to get the X server and enlightenment to start properly. I
> finally settled on just creating the .xinitrc and let the rest be sorted
> out with startx.
> 
> But then, again, if I just start enlightenment without valgrind it
> works. With valgrind enabled everything stops at a black screen and the
> only way to get a responsive interface again is to reboot the machine.
> 
> So here's what I do: https://pastebin.com/yzhy4gj1
> 
> The first part shows my .xinitrc. At the end you see two alternative
> exec commands. The one with valgrind causes everything to hang. The one
> without works just fine.
> 
> Even though with valgrind enabled I cannot really do anything at least
> there is still heaps of stuff in the logfile, so that output is also
> included. Many "lost bytes" (not really dangerous, right?) and an
> unhandled instruction in e_comp_x_randr.c. Hmmm.

unhanded instruction. that means your compiler is outputting instructions
valgrind does not know how to interpret. e.g. it is optimizing for a newer x86
instruction. you might want to compile with -mpentium in CFLAGS or something
very conservative. you also might want to avoid --trace-children=yes if you are
running enlightenment directly (avoiding enlightenment_start).


> Cheers
> Florian
> 
> > > 3. Then I tried it again, removing from the .xinitrc-debug script all
> > > options from valgrind but the --tool=memcheck one, thus being closer to
> > > the first example of using valgrind. This caused a complete lockup of my
> > > computer and my only rescue was a reboot via SysRq.
> > > 
> > > I guess I will have to try this again with a somewhat different
> > > approach...
> > > 
> > > Cheers,
> > > Florian
> > > 
> > > PS: Can I hijack this thread to quickly paste an eina trace I get all
> > > the time when openening everying? ;) https://pastebin.com/rvupgMcx
> 
> 
> _______________________________________________
> enlightenment-users mailing list
> enlightenment-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/enlightenment-users
> 


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
Carsten Haitzler - ras...@rasterman.com



_______________________________________________
enlightenment-users mailing list
enlightenment-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/enlightenment-users

Re: [e-users] E crash with Nvidia

Reply via email to