On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <gael.varoqu...@normalesup.org> wrote:
> The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model, which is > defined via the internals of objects. > Plus the fact that loading a pickle can execute arbitrary code, and there is no way to know if any malicious code is in there in advance because the contents of the pickle cannot be easily inspected without loading/executing it. > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code > that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > My "text-based serialization" suggestion was nowhere near as ambitious as that, as I have already explained, and wasn't aiming at solving the versioning issues, but rather at having something which is "about as good" as pickle but in a human-readable format. I am not asking for a Turing-complete language to reproduce the prediction function, but rather something simple in the spirit of the output produced by the gist code I linked above, just for the model families where it is reasonable: https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 The code I posted mostly works (specific cases of nested models need to be addressed separately, as well as pipelines), and we have been using (a version of) it in production for quite some time. But there are hackish aspects to it that we are not happy with, such as the manual separation of init and fitted parameters by checking if the name ends with "_", having to infer class name and location using "model.__class__.__name__" and "model.__module__", and the wacky use of "__import__". My suggestion was more along the lines of adding some metadata to sklearn estimators so that a code in a similar style would be nicer to write; little things like having a `init_parameters` and `fit_parameters` properties that would return the lists of named parameters, or a `model_info` method that would return data like sklearn version, class name and location, or a package level dictionary pointing at the estimator classes by a string name, like from sklearn.linear_models import LogisticRegression estimator_classes = {"LogisticRegression": LogisticRegression, ...} so that one can load the appropriate class from the string description without calling __import__ or eval; that sort of stuff. I am aware this would not address the common complain of "prefect prediction reproducibility" across versions, but I think we can all agree that this utopia of perfect reproducibility is not feasible. And in the long, long run, I agree that PFA/onnx or whichever similar format that emerges, is the way to go. J
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn