reductionista opened a new pull request #560: URL: https://github.com/apache/madlib/pull/560
Here are a few improvements to how we process compile params, so that the code is more robust from a security perspective. I haven't found a specific security exploit for these issues, but I wouldn't be surprised if someone with enough time and dedication could find a way to gain admin access via them... especially if there were a minor update to either MADlib's deep learning library or keras which doesn't take into account these potential issues. (And at the very least, this fixes some odd error messages and crashes that can happen when strange parameters are passed.) Changes: 1. Exclude `deserialize`, `serialize`, `get`, and anything starting with `_` from the list of builtin losses and metrics we support. We were unintentionally allowing users to pass `keras.losses.deserialize`, `keras.losses._sys`, `keras.losses.get`, `keras.losses.__builtins__`, etc. as loss functions to keras, and the same for `keras.metrics.*` for metric functions. Most of what's in `keras.losses.*` are loss functions or classes deriving from `keras.losses.Loss`, and most of what's in `keras.metrics` are metric functions or classes deriving from `keras.metrics.Metric`, but there are also these extra classes intended for other purposes (not intended for use directly as metrics or losses). The `deserialize()` function is particularly dangerous, in part because they accept exactly 2 arguments which means, even if you just pass something as simple as `loss=deserialize`, it will be called by keras itself during evaluation as `deserialize(y_true, y_pred)`. And in part because the purpose of the `deserialize()` utility function is to take a string from the user and turn it into an instance of a callable class, which will then be called by keras. In other words, it's a way of allowing users to pass in their own custom functions, but in a way such that MADlib only sees it as a `str` to pass along (rather than an arbitrary callable python object, which we do not allow ordinary users to pass through to keras). The danger is significantly mitigated by the fact that we don't allow any characters after the function name to be passed, even as a string. For example, if the user were allowed to pass `loss=deserialize(..., ...)` then it would be fairly easy to gain admin access. But we only allow `loss=deserialize` which makes it far more difficult if not impossible to exploit. Nevertheless, this PR completely disables the use of such functions for losses or metrics, in case there is some way to trick keras into passing malicious arguments for y_true and y_pred. 2. Another potentially risky behavior was the input validator allowing the user to pass as a metric any string which contains the substring `top_k_categorical_accuracy`. This provides a loophole that circumvents the intention to only allow metric names which are either a builtin keras metric or something in the custom functions object table. For example, if the user passes `metrics=[system("echo allow all >> ${MASTER_DATA_DIRECTORY}/pg_hba.conf")] /* top_k_categorical_accuracy */, losses=...` for compile params, that will be considered a valid input by the input validator and passed along. Once again, this doesn't appear to be an immediate security issue, as later we run a `literal_eval()` on it, which will error out. But it's safest if we catch things like this up front during input validation. From what I can tell, explicitly whitelisting *top_k_categorical_acurracy* for a metric is unnecessary, since it is already a builtin keras function, and we don't allow any arguments to be passed to it anyway, it must be an exact match to a builtin or a custom function. 3. We were comparing the metrics parameter value substring `metrics[2:-2]` to the builtin and custom function names, which implicitly assumes that the first 2 characters are [' or [" and the last two are '] or "]. Handling this more carefully using a literal eval to extract the string inside the first element of the list prevents sneaky inputs like `metrics=[*__builtins__ ]`. (We already validate elsewhere that the full metrics parameter value is of type `list`.) ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org