Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Andreas Mueller



On 09/28/2018 04:45 PM, Javier López wrote:



On Fri, Sep 28, 2018 at 8:46 PM Andreas Mueller > wrote:


Basically what you're saying is that you're fine with versioning the
models and having the model break loudly if anything changes.
That's not actually what most people want. They want to be able to
make
predictions with a given model for ever into the future.


Are we talking about "(the new version of) the old model can still 
make predictions" or "the old model makes exactly the same predictions 
as before"? I'd like the first to hold, don't care that much about the 
second.

The second.


We're now storing the version of scikit-learn that was used in the
pickle and warn if you're trying to load with a different version.


This is not the whole truth. Yes, you store the sklearn version on the 
pickle and raise a warning; I am mostly ok with that, but the pickles 
are brittle and oftentimes they stop loading when other versions of 
other stuff change. I am not talking about "Warning: wrong version", 
but rather "Unpickling error: expected bytes, found tuple" that 
prevent the file from loading entirely.

Can you give examples of that? That shouldn't really happen afaik.


That's basically a stricter test than what you wanted. Yes, there are
false positives, but given that this release took a year,
this doesn't seem that big an issue?


1. Things in the current state break when something else changes, not 
only sklearn.

2. Sharing pickles is a bad practice due to a number of reasons.
3. We might want to explore model parameters without having to load 
the entire runtime



I agree, it would be great to have something other than pickle, but as I 
said, the usual request is "I want a way for a model to make the same 
predictions in the future".
If you have a way to do that with a text-based format that doesn't 
require writing lots of version converters I'd be very happy.


Generally, what you want is not to store the model but to store the 
prediction function, and have separate runtimes for training and prediction.
It might not be possible to represent a model from a previous version of 
scikit-learn in a newer version.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Javier López
On Fri, Sep 28, 2018 at 8:46 PM Andreas Mueller  wrote:

> Basically what you're saying is that you're fine with versioning the
> models and having the model break loudly if anything changes.
> That's not actually what most people want. They want to be able to make
> predictions with a given model for ever into the future.
>

Are we talking about "(the new version of) the old model can still make
predictions" or "the old model makes exactly the same predictions as
before"? I'd like the first to hold, don't care that much about the second.


>
> Your use-case is similar, but if retraining the model is not an issue,
> why don't you want to retrain every time scikit-learn releases a new
> version?
>

Thousands of models. I don't want to retrain ALL of them unless needed


> We're now storing the version of scikit-learn that was used in the
> pickle and warn if you're trying to load with a different version.


This is not the whole truth. Yes, you store the sklearn version on the
pickle and raise a warning; I am mostly ok with that, but the pickles are
brittle and oftentimes they stop loading when other versions of other stuff
change. I am not talking about "Warning: wrong version", but rather
"Unpickling error: expected bytes, found tuple" that prevent the file from
loading entirely.


> That's basically a stricter test than what you wanted. Yes, there are
> false positives, but given that this release took a year,
> this doesn't seem that big an issue?
>

1. Things in the current state break when something else changes, not only
sklearn.
2. Sharing pickles is a bad practice due to a number of reasons.
3. We might want to explore model parameters without having to load the
entire runtime

Also, in order to retrain the model we need to keep the whole model
description with parameters. This needs to be saved somewhere, which in the
current state would force us to keep two files: one with the parameters (in
a text format to avoid the "non-loadng" problems from above) and the pkl
with the fitted model. My proposal would keep both in a single file.

As mentioned in previous emails, we already have our own solution that
kind-of-works for our needs, but we have to do a few hackish things to keep
things running. If sklearn estimators simply included a text serialization
method (similar in spirit to the one used for __display__ or __repr__) it
would make things easier.

But I understand that not everyone's needs are the same, so if you guys
don't consider this type of thing a priority, we can live with that :) I
mostly mentioned it since "Backwards-compatible de/serialization of some
estimators" is listed in the roadmap as a desirable goal for version 1.0
and feedback on such roadmap was requested.

J
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Andreas Mueller



On 09/28/2018 03:20 PM, Javier López wrote:
I understand the difficulty of the situation, but an approximate 
solution to that is saving the predictions from a large enough 
validation set. If the prediction for the newly created model are 
"close enough" to the old ones, we deem the unserialized model to be 
the same and move forward, if there are serious discrepancies, then we 
dive deep to see what's going on, and if needed refit the offending 
submodels with the newer version.


Basically what you're saying is that you're fine with versioning the 
models and having the model break loudly if anything changes.
That's not actually what most people want. They want to be able to make 
predictions with a given model for ever into the future.


Your use-case is similar, but if retraining the model is not an issue, 
why don't you want to retrain every time scikit-learn releases a new 
version?
We're now storing the version of scikit-learn that was used in the 
pickle and warn if you're trying to load with a different version.
That's basically a stricter test than what you wanted. Yes, there are 
false positives, but given that this release took a year,

this doesn't seem that big an issue?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Javier López
On Fri, Sep 28, 2018 at 6:41 PM Andreas Mueller  wrote:

> Javier:
> The problem is not so much storing the "model" but storing how to make
> predictions. Different versions could act differently
> on the same data structure - and the data structure could change. Both
> happen in scikit-learn.
> So if you want to make sure the right thing happens across versions, you
> either need to provide serialization and deserialization for
> every version and conversion between those or you need to provide a way
> to store the prediction function,
> which basically means you need a turing-complete language (that's what
> ONNX does).
>

I understand the difficulty of the situation, but an approximate solution
to that is saving the predictions from a large enough validation set. If
the prediction for the newly created model are "close enough" to the old
ones, we deem the unserialized model to be the same and move forward, if
there are serious discrepancies, then we dive deep to see what's going on,
and if needed refit the offending submodels with the newer version.
Since we only want to compare the predictions here, we don't need a ground
truth and thus the validation set doesn't even need to be a real dataset,
it can consist of synthetic datapoints created via SMOTE, Caruana's MUNGE
algorithm, or any other method, and can be made arbitrarily large on in
advance.

This method has worked reasonably well for us in practice; we deal with
ensembles containing
hundreds or thousands of models, and this technique saves us from having to
refit many of them that don't change very often, and if something changes a
lot, we want to know in either case to ascertain what was amiss (either
with the old version or with the new one).

The situation I am proposing is not worse than what we have right now,
which is save a pickle and then hope that it can be read later on;
sometimes it can, sometimes it cannot depending on what changed. Stuff
unrelated to the models themselves, such as changes in the joblib dump
method broke several of our pickles files in the past. What I would like to
have is a text-based representation of the fitted model that can always be
read, stored in a database, or sent over the wire through simple methods.

J
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Manuel CASTEJÓN LIMAS via scikit-learn
How about a docker based approach? Just thinking out loud
Best
Manuel

El vie., 28 sept. 2018 19:43, Andreas Mueller  escribió:

>
>
> On 09/28/2018 01:38 PM, Andreas Mueller wrote:
> >
> >
> > On 09/28/2018 12:10 PM, Sebastian Raschka wrote:
>  I think model serialization should be a priority.
> >>> There is also the ONNX specification that is gaining industrial
> >>> adoption and that already includes open source exporters for several
> >>> families of scikit-learn models:
> >>>
> >>> https://github.com/onnx/onnxmltools
> >>
> >> Didn't know about that. This is really nice! What do you think about
> >> referring to it under
> >> http://scikit-learn.org/stable/modules/model_persistence.html to make
> >> people aware that this option exists?
> >> Would be happy to add a PR.
> >>
> >>
> > I don't think an open source runtime has been announced yet (or they
> > didn't email me like they promised lol).
> > I'm quite excited about this as well.
> >
> > Javier:
> > The problem is not so much storing the "model" but storing how to make
> > predictions. Different versions could act differently
> > on the same data structure - and the data structure could change. Both
> > happen in scikit-learn.
> > So if you want to make sure the right thing happens across versions,
> > you either need to provide serialization and deserialization for
> > every version and conversion between those or you need to provide a
> > way to store the prediction function,
> > which basically means you need a turing-complete language (that's what
> > ONNX does).
> >
> > We basically said doing the first is not feasible within scikit-learn
> > given our current amount of resources, and no-one
> > has even tried doing it outside of scikit-learn (which would be
> > possible).
> > Implementing a complete prediction serialization language (the second
> > option) is definitely outside the scope of sklearn.
> >
> >
> Maybe we should add to the FAQ why serialization is hard?
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Andreas Mueller




On 09/28/2018 01:38 PM, Andreas Mueller wrote:



On 09/28/2018 12:10 PM, Sebastian Raschka wrote:

I think model serialization should be a priority.
There is also the ONNX specification that is gaining industrial 
adoption and that already includes open source exporters for several 
families of scikit-learn models:


https://github.com/onnx/onnxmltools


Didn't know about that. This is really nice! What do you think about 
referring to it under 
http://scikit-learn.org/stable/modules/model_persistence.html to make 
people aware that this option exists?

Would be happy to add a PR.


I don't think an open source runtime has been announced yet (or they 
didn't email me like they promised lol).

I'm quite excited about this as well.

Javier:
The problem is not so much storing the "model" but storing how to make 
predictions. Different versions could act differently
on the same data structure - and the data structure could change. Both 
happen in scikit-learn.
So if you want to make sure the right thing happens across versions, 
you either need to provide serialization and deserialization for
every version and conversion between those or you need to provide a 
way to store the prediction function,
which basically means you need a turing-complete language (that's what 
ONNX does).


We basically said doing the first is not feasible within scikit-learn 
given our current amount of resources, and no-one
has even tried doing it outside of scikit-learn (which would be 
possible).
Implementing a complete prediction serialization language (the second 
option) is definitely outside the scope of sklearn.




Maybe we should add to the FAQ why serialization is hard?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Andreas Mueller




On 09/28/2018 12:10 PM, Sebastian Raschka wrote:

I think model serialization should be a priority.

There is also the ONNX specification that is gaining industrial adoption and 
that already includes open source exporters for several families of 
scikit-learn models:

https://github.com/onnx/onnxmltools


Didn't know about that. This is really nice! What do you think about referring 
to it under http://scikit-learn.org/stable/modules/model_persistence.html to 
make people aware that this option exists?
Would be happy to add a PR.


I don't think an open source runtime has been announced yet (or they 
didn't email me like they promised lol).

I'm quite excited about this as well.

Javier:
The problem is not so much storing the "model" but storing how to make 
predictions. Different versions could act differently
on the same data structure - and the data structure could change. Both 
happen in scikit-learn.
So if you want to make sure the right thing happens across versions, you 
either need to provide serialization and deserialization for
every version and conversion between those or you need to provide a way 
to store the prediction function,
which basically means you need a turing-complete language (that's what 
ONNX does).


We basically said doing the first is not feasible within scikit-learn 
given our current amount of resources, and no-one

has even tried doing it outside of scikit-learn (which would be possible).
Implementing a complete prediction serialization language (the second 
option) is definitely outside the scope of sklearn.



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Sebastian Raschka
> 
> > I think model serialization should be a priority.
> 
> There is also the ONNX specification that is gaining industrial adoption and 
> that already includes open source exporters for several families of 
> scikit-learn models:
> 
> https://github.com/onnx/onnxmltools


Didn't know about that. This is really nice! What do you think about referring 
to it under http://scikit-learn.org/stable/modules/model_persistence.html to 
make people aware that this option exists?
Would be happy to add a PR.

Best,
Sebastian



> On Sep 28, 2018, at 9:30 AM, Olivier Grisel  wrote:
> 
> 
> > I think model serialization should be a priority.
> 
> There is also the ONNX specification that is gaining industrial adoption and 
> that already includes open source exporters for several families of 
> scikit-learn models:
> 
> https://github.com/onnx/onnxmltools
> 
> -- 
> Olivier
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Huge huge Thank you developers!
Keep up the good work!

El mié., 26 sept. 2018 20:57, Andreas Mueller  escribió:

> Hey everbody!
> I'm happy to (finally) announce scikit-learn 0.20.0.
> This release is dedicated to the memory of Raghav Rajagopalan.
>
> You can upgrade now with pip or conda!
>
> There is many important additions and updates, and you can find the full
> release notes here:
> http://scikit-learn.org/stable/whats_new.html#version-0-20
>
> My personal highlights are the ColumnTransformer and the changes to
> OneHotEncoder,
> but there's so much more!
>
> An important note is that this is the last version to support Python2.7,
> and the
> next release will require Python 3.5.
>
> A big thank you to everybody who contributed and special thanks to Joel!
>
> All the best,
> Andy
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Olivier Grisel
>
>
> > I think model serialization should be a priority.
>

There is also the ONNX specification that is gaining industrial adoption
and that already includes open source exporters for several families of
scikit-learn models:

https://github.com/onnx/onnxmltools

-- 
Olivier
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Javier López
On Fri, Sep 28, 2018 at 1:03 AM Sebastian Raschka 
wrote:

> Chris Emmery, Chris Wagner and I toyed around with JSON a while back (
> https://cmry.github.io/notes/serialize), and it could be feasible


I came across your notes a while back, they were really useful!
I hacked a variation of it that didn't need to know the model class in
advance:
https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
but is is VERY hackish, and it doesn't work with complex models with nested
components. (At work we use a further variation of this that also works on
pipelines and some specific nested stuff, like `mlxtend`'s
`SequentialFeatureSelector`)


> but yeah, it will involve some work, especially with testing things
> thoroughly for all kinds of estimators. Maybe this could somehow be
> automated though in a grid-search kind of way with a build matrix for
> estimators and parameters once a general framework has been developed.
>

I considered making this serialization into an external project, but I
think this would be much easier if estimators provided a dunder method
`__serialize__` (or whatever) that would handle the idiosyncrasies of each
particular family, I don't believe there will be a "one-size-fits-all"
solution for this problem. This approach would also make it possible to
work on it incrementally, raising a default `NotImplementedError` for
estimators that haven't been addressed yet.

In the long run, I also believe that the "proper" way to do this is to
allow dumping entire processes into PFA: http://dmg.org/pfa/docs/motivation/
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Full time jobs to work on scikit-learn in Paris

2018-09-28 Thread Gael Varoquaux
Dear list,

I am very happy to announce that the Inria foundation is looking to hire
two persons to work with the scikit-learn in France:

* One Community and Operation Officer:
  https://scikit-learn.fondation-inria.fr/job_coo/
  We need a good mix of communication, organizational, and technical skills 
  to help the team and the community work best together

* One Performance and Quality Engineer:
  https://scikit-learn.fondation-inria.fr/en/job_performance/
  We need someone who care about tests, continuous integration and
  performance, to help making scikit-learn faster will guaranteeing that
  it stays as solid as it is.

Please forward this announcement to anyone who might be interested.

Best,

Gaël
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn