Hi Kirill, Thanks for your hints and time! Unfortunately, I think lrandom is not the cause of crash. We're using lrandom with threads for couple of months on our other servers without any crash. I think lua in HAproxy is executed in a single thread so your analysis is correct but this assumption is never true: "Let assume that we called genrand_int32 at the same moment from two threads." in HAProxy environment.
I suspect something is going on in SPOE or LUA scripts from external vendor. I'll share more details as soon as I confirm it is in SPOE or LUA. pt., 25 wrz 2020 o 12:34 Kirill A. Korinsky <kir...@korins.ky> napisaĆ(a): > Good day, > > I'd like to share with your my two cents regarding this topic: > > lrandom (PRNG for lua, we're using it for 2 or 3 years without any > problems, and soon we will drop it from our build) > > > Never heard of this last one, not that it would make it suspicious at > all, just that it might indicate you're having a slightly different > workload than most common ones and can help spotting directions where > to look for the problem. > > > > As far as I know Haproxy is using threads by default for some time and I > assume that Maciej's setup doesn't change anything and it had enabled > threads. > > If so I believe that lrandom is the root cause of this issue. > > I've extracted a pice of code from lrandom and put it here: > https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416 > > You can see that object generator contains N words (and here it is 624), > and I use an assumption that Maciej's code doesn't create a new generator > for each request and share lrandom. > > Idea of this RNG is initialize each N words via init_genrand and it > checking that all of them are used, and after one generated a new ones. > > Let assume that we called genrand_int32 at the same moment from two > threads. If condition at lines 39 and 43 are true we start to initialize > the next words at both threads. > > You can see that we can easy move outside of v array at line 21 because > two threads are increasing i field, and put some random number to i field. > > Ans when the second thread is going to line 27 and nobody knows where it > put 0xffffffff > > Let me quote Willy Tarreau: > > In the trace it's said that sw = 0xffffffff. Looking at all places where > h2s->recv_wait() is modified, it's either NULL or a valid pointer to some > structure. We could have imagined that for whatever reason h2s is wrong > here, but this call only happens when its state is still valid, and it > experiences double dereferences before landing here, which tends to > indicate that the h2s pointer is OK. Thus the only hypothesis I can have > for now is memory corruption :-/ That field would get overwritten with > (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not > as if we had many of these. > > > and base on this I believe that it is the case. > > How can it be proved / solved? > > I see a few possible options: > 1. Switch off threads inside haproxy > 2. Use dedicated lrandom per thread > 3. Move away from lrandom > > As I understand lrandom is using here because it is very fast and secure, > and reading from /dev/urandom isn't an option. > > Here I can suggest to implement Yarrow PRGN (that is very simple to > implement) with some lua-pure cryptographic hash function. > > -- > wbr, Kirill > >