Hi Zoltán, it's been a while! On Fri, 2018 Jun 22 05:55+0200, Zoltán Herczeg wrote: > Hi, > > to tell the truth, when the serialization was created the use case we > were discussing was different from the use case below. > > I consider serialized forms inherently unsecure. I would never > recommend to accept any regexes in binary forms for any application. > Instead, I would recommend to distribute patterns in text form, then > the application pre-compiles them and store them in a secure way. The > application can also store both the text and binary forms, and after > any regex engine changes, pre-compile the patterns again.
I can understand the security implications of loading serialized regexes, but beyond validation of the input, and recommendations on how to use this feature, there's not much more we (PCRE) can do about that. For my part, all I can say is... I'm a big boy, I can handle it :-) I see the approach you are suggesting here; e.g. an application compiles a regex on the first run, and caches the serialized form in /var/cache/foo/ for later use. Anytime the format changes, it re-compiles and re-caches same. In my use case, however, the application has binary data files [containing serialized regexes] under /usr/share/foo/, and no provision is available to cache under /var/, nor any other writable disk location. PCRE2 can be updated at any time due to security vulnerabilities, but the application's data files are tied to release cycles that take the better part of a year to complete. > While this requires more disk space, it is usually less of an > issue than the security implications of distributing regexes in > binary forms. Disk space is not the concern here, but the non-trivial amount of time it can take to (re-)compile a large regex. > One option could be versioning serialized regexes. In another project > (JerryScript) we use versioning for snapshots (serialized form of > JavaScript code), and the version number grows after any change that > affects snapshots. It is not a high burden, but it is easy to forget > in my experiences, especially for people newly joined to the project. > We have never went beyond that, supporting two snapshot formats in one > engine sounds like too much burden. Writing conversion tools also. When you say "two snapshot formats," do you mean two formats that are completely different, or two formats that are identical but for one or two newly-added features? Straight versioning doesn't exactly distinguish between these two scenarios, which is why I'd imagine you'd want more of a modular PNG-like chunked format for this. In any event, as I wrote to Philip, the format used for serialization should be independent of the in-memory representation, so that it is minimally affected by the vagaries of ongoing engine development. That way, it is less likely to need to change over time, which eases the maintenance burden and improves the prospects for future compatibility. --Daniel P.S.: Please Cc: me on any replies, as I am not subscribed to this list. -- Daniel Richard G. || [email protected] My ASCII-art .sig got a bad case of Times New Roman. -- ## List details at https://lists.exim.org/mailman/listinfo/pcre-dev
