Good day,

I'd like to share with your my two cents regarding this topic:

>> lrandom (PRNG for lua, we're using it for 2 or 3 years without any
>> problems, and soon we will drop it from our build)
> 
> Never heard of this last one, not that it would make it suspicious at
> all, just that it might indicate you're having a slightly different
> workload than most common ones and can help spotting directions where
> to look for the problem.


As far as I know Haproxy is using threads by default for some time and I assume 
that Maciej's setup doesn't change anything and it had enabled threads.

If so I believe that lrandom is the root cause of this issue.

I've extracted a pice of code from lrandom and put it here: 
https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416 
<https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416>

You can see that object generator contains N words (and here it is 624), and I 
use an assumption that Maciej's code doesn't create a new generator for each 
request and share lrandom.

Idea of this RNG is initialize each N words via init_genrand and it checking 
that all of them are used, and after one generated a new ones.

Let assume that we called genrand_int32 at the same moment from two threads. If 
condition at lines 39 and 43 are true we start to initialize the next words at 
both threads.

You can see that we can easy move outside of v array at line 21 because two 
threads are increasing i field, and put some random number to i field.

Ans when the second thread is going to line 27 and nobody knows where it put 
0xffffffff

Let me quote Willy Tarreau:

> In the trace it's said that sw = 0xffffffff. Looking at all places where
> h2s->recv_wait() is modified, it's either NULL or a valid pointer to some
> structure. We could have imagined that for whatever reason h2s is wrong
> here, but this call only happens when its state is still valid, and it
> experiences double dereferences before landing here, which tends to
> indicate that the h2s pointer is OK. Thus the only hypothesis I can have
> for now is memory corruption :-/ That field would get overwritten with
> (int)-1 for whatever reason, maybe a wrong cast somewhere, but it's not
> as if we had many of these.

and base on this I believe that it is the case.

How can it be proved / solved?

I see a few possible options:
1. Switch off threads inside haproxy
2. Use dedicated lrandom per thread
3. Move away from lrandom

As I understand lrandom is using here because it is very fast and secure, and 
reading from /dev/urandom isn't an option.

Here I can suggest to implement Yarrow PRGN (that is very simple to implement) 
with some lua-pure cryptographic hash function.

--
wbr, Kirill

Attachment: signature.asc
Description: Message signed with OpenPGP

Reply via email to