On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <gael.varoqu...@normalesup.org>
wrote:

> The reason that pickles are brittle and that sharing pickles is a bad
> practice is that pickle use an implicitly defined data model, which is
> defined via the internals of objects.
>

Plus the fact that loading a pickle can execute arbitrary code, and there
is no way to know
if any malicious code is in there in advance because the contents of the
pickle cannot
be easily inspected without loading/executing it.


> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing persistence code
> that
> does not fall in these problems is very costly in terms of developer time
> and makes it harder to add new methods or improve existing one. I am not
> excited about it.
>

My "text-based serialization" suggestion was nowhere near as ambitious as
that,
as I have already explained, and wasn't aiming at solving the versioning
issues, but
rather at having something which is "about as good" as pickle but in a
human-readable
format. I am not asking for a Turing-complete language to reproduce the
prediction
function, but rather something simple in the spirit of the output produced
by the gist code I linked above, just for the model families where it is
reasonable:

https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31

The code I posted mostly works (specific cases of nested models need to be
addressed
separately, as well as pipelines), and we have been using (a version of) it
in production
for quite some time. But there are hackish aspects to it that we are not
happy with,
such as the manual separation of init and fitted parameters by checking if
the name ends with "_", having to infer class name and location using
"model.__class__.__name__" and "model.__module__", and the wacky use of
"__import__".

My suggestion was more along the lines of adding some metadata to sklearn
estimators so
that a code in a similar style would be nicer to write; little things like
having a `init_parameters` and `fit_parameters` properties that would
return the lists of named parameters,
or a `model_info` method that would return data like sklearn version, class
name and location, or a package level dictionary pointing at the estimator
classes by a string name, like

from sklearn.linear_models import LogisticRegression
estimator_classes = {"LogisticRegression": LogisticRegression, ...}

so that one can load the appropriate class from the string description
without calling __import__ or eval; that sort of stuff.

I am aware this would not address the common complain of "prefect
prediction reproducibility"
across versions, but I think we can all agree that this utopia of perfect
reproducibility is not
feasible.

And in the long, long run, I agree that PFA/onnx or whichever similar
format that emerges, is
the way to go.

J
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to