reductionista opened a new pull request #560:
URL: https://github.com/apache/madlib/pull/560


   Here are a few improvements to how we process compile params, so that the 
code is more robust from a security perspective. 
   
      I haven't found a specific security exploit for these issues, but I 
wouldn't be surprised if someone with enough time and dedication could find a 
way to gain admin access via them... especially if there were a minor update to 
either MADlib's deep learning library or keras which doesn't take into account 
these potential issues.  (And at the very least, this fixes some odd error 
messages and crashes that can happen when strange parameters are passed.)
   
   Changes:
   
   1.  Exclude `deserialize`, `serialize`, `get`, and anything starting with 
`_` from the list of builtin losses and metrics we support.
   
   We were unintentionally allowing users to pass `keras.losses.deserialize`, 
`keras.losses._sys`, `keras.losses.get`, `keras.losses.__builtins__`, etc. as 
loss functions to keras, and the same for `keras.metrics.*` for metric 
functions.  Most of what's in `keras.losses.*` are loss functions or classes 
deriving from `keras.losses.Loss`, and most of what's in `keras.metrics` are 
metric functions or classes deriving from `keras.metrics.Metric`, but there are 
also these extra classes intended for other purposes (not intended for use 
directly as metrics or losses).
   
   The `deserialize()` function is particularly dangerous, in part because they 
accept exactly 2 arguments which means, even if you just pass something as 
simple as `loss=deserialize`, it will be called by keras itself during 
evaluation as `deserialize(y_true, y_pred)`.  And in part because the purpose 
of the `deserialize()` utility function is to take a string from the user and 
turn it into an instance of a callable class, which will then be called by 
keras.  In other words, it's a way of allowing users to pass in their own 
custom functions, but in a way such that MADlib only sees it as a `str` to pass 
along (rather than an arbitrary callable python object, which we do not allow 
ordinary users to pass through to keras).
   
   The danger is significantly mitigated by the fact that we don't allow any 
characters after the function name to be passed, even as a string.  For 
example, if the user were allowed to pass `loss=deserialize(..., ...)` then it 
would be fairly easy to gain admin access.  But we only allow 
`loss=deserialize` which makes it far more difficult if not impossible to 
exploit.  Nevertheless, this PR completely disables the use of such functions 
for losses or metrics, in case there is some way to trick keras into passing 
malicious arguments for y_true and y_pred.
   
   2.  Another potentially risky behavior was the input validator allowing the 
user to pass as a metric any string which contains the substring 
`top_k_categorical_accuracy`.  This provides a loophole that circumvents the 
intention to only allow metric names which are either a builtin keras metric or 
something in the custom functions object table.  For example, if the user 
passes `metrics=[system("echo allow all >> 
${MASTER_DATA_DIRECTORY}/pg_hba.conf")] /* top_k_categorical_accuracy */, 
losses=...` for compile params, that will be considered a valid input by the 
input validator and passed along.  Once again, this doesn't appear to be an 
immediate security issue, as later we run a `literal_eval()` on it, which will 
error out.  But it's safest if we catch things like this up front during input 
validation.
   
   From what I can tell, explicitly whitelisting *top_k_categorical_acurracy* 
for a metric is unnecessary, since it is already a builtin keras function, and 
we don't allow any arguments to be passed to it anyway, it must be an exact 
match to a builtin or a custom function.
   
   3.  We were comparing the metrics parameter value substring `metrics[2:-2]` 
to the builtin and custom function names, which implicitly assumes that the 
first 2 characters are [' or [" and the last two are '] or "].
   
   Handling this more carefully using a literal eval to extract the string 
inside the first element of the list prevents sneaky inputs like 
`metrics=[*__builtins__ ]`.  (We already validate elsewhere that the full 
metrics parameter value is of type `list`.)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to