Re: [2.0.17] crash with coredump

Maciej Zdeb Fri, 25 Sep 2020 05:40:42 -0700

Hi Kirill,

Thanks for your hints and time! Unfortunately, I think lrandom is not the
cause of crash. We're using lrandom with threads for couple of months on
our other servers without any crash. I think lua in HAproxy is executed in
a single thread so your analysis is correct but this assumption is never
true: "Let assume that we called genrand_int32 at the same moment from two
threads." in HAProxy environment.


I suspect something is going on in SPOE or LUA scripts from external
vendor. I'll share more details as soon as I confirm it is in SPOE or LUA.


pt., 25 wrz 2020 o 12:34 Kirill A. Korinsky <kir...@korins.ky> napisał(a):

> Good day,
>
> I'd like to share with your my two cents regarding this topic:
>
> lrandom (PRNG for lua, we're using it for 2 or 3 years without any
> problems, and soon we will drop it from our build)
>
>
> Never heard of this last one, not that it would make it suspicious at
> all, just that it might indicate you're having a slightly different
> workload than most common ones and can help spotting directions where
> to look for the problem.
>
>
>
> As far as I know Haproxy is using threads by default for some time and I
> assume that Maciej's setup doesn't change anything and it had enabled
> threads.
>
> If so I believe that lrandom is the root cause of this issue.
>
> I've extracted a pice of code from lrandom and put it here:
> https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416
>
> You can see that object generator contains N words (and here it is 624),
> and I use an assumption that Maciej's code doesn't create a new generator
> for each request and share lrandom.
>
> Idea of this RNG is initialize each N words via init_genrand and it
> checking that all of them are used, and after one generated a new ones.
>
> Let assume that we called genrand_int32 at the same moment from two
> threads. If condition at lines 39 and 43 are true we start to initialize
> the next words at both threads.
>
> You can see that we can easy move outside of v array at line 21 because
> two threads are increasing i field, and put some random number to i field.
>
> Ans when the second thread is going to line 27 and nobody knows where it
> put 0xffffffff
>
> Let me quote Willy Tarreau:
>
> In the trace it's said that sw = 0xffffffff. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.
>
>
> and base on this I believe that it is the case.
>
> How can it be proved / solved?
>
> I see a few possible options:
> 1. Switch off threads inside haproxy
> 2. Use dedicated lrandom per thread
> 3. Move away from lrandom
>
> As I understand lrandom is using here because it is very fast and secure,
> and reading from /dev/urandom isn't an option.
>
> Here I can suggest to implement Yarrow PRGN (that is very simple to
> implement) with some lua-pure cryptographic hash function.
>
> --
> wbr, Kirill
>
>

Re: [2.0.17] crash with coredump

Reply via email to