date:20181003

[scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Alex Garel

Le 02/10/2018 à 16:46, Andreas Mueller a écrit :
> Thank you for your feedback Alex!
Thanks for answering !

>
> On 10/02/2018 09:28 AM, Alex Garel wrote:
>>
>>   * chunk processing (kind of handling streaming data) :  when
>> dealing with lot of data, the ability to fit_partial, then use
>> transform on chunks of data is of good help. But it's not well
>> exposed in current doc and API,
>>
> This has been discussed in the past, but it looks like no-one was
> excited enough about it to add it to the roadmap.
> This would require quite some additions to the API. Olivier, who has
> been quite interested in this before now seems
> to be more interested in integration with dask, which might achieve
> the same thing.

I've tried to use Dask on my side, but for now, though going quite
ahead, I didn't suceed completly because (in my specific case) of memory
issues (dask default schedulers do not specialize processes on tasks,
and I had some memory consuming tasks but I didn't get far enough to
write my own scheduler). However I might deal with that later (not
writing a scheduler but sharing memory with mmap, in this case).
But yes Dask is about the "chunk instead of really streaming" approach
(which was my point).

>>   * and a lot of models do not support it, while they could.
>>
> Can you give examples of that? 
Hum I spoke maybe too fast ! Greping the code give me some example at
least, and it's true that a DecisionTree does not hold it naturally !

>>   * Also pipeline does not support fit_partial and there is not
>> fit_transform_partial.
>>
> What would you expect those to do? Each step in the pipeline might
> require passing over the whole dataset multiple times
> before being able to transform anything. That basically makes the
> current interface impossible to work with the pipeline.
> Even if only a single pass of the dataset was required, that wouldn't
> work with the current interface.
> If we would be handing around generators that allow to loop over the
> whole data, that would work. But it would be unclear
> how to support a streaming setting.
You're right, I didn't think hard enough about it !

BTW I made some test using generators and making fit / transform build
pipelines that I consumed latter on (tried with plain iterators and
streamz).
It did work somehow, with much hacks, but in my specific case,
performance where not good enough. (real problem was not framework
performance, but my architecture where I realize, that constantly
re-generating data instead of doing it once was not fast enough).

So finally my points were not so good, but at least I did learn
something ;-)

Thanks for your time.

-- 
Alexandre Garel
tel : +33 7 68 52 69 07 / +213 656 11 85 10
skype: alexgarel / ring: ba0435e11af36e32e9b4eb13c19c52fd75c7b4b0

signature.asc
Description: OpenPGP digital signature
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Javier López

On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux 
wrote:

> The reason that pickles are brittle and that sharing pickles is a bad
> practice is that pickle use an implicitly defined data model, which is
> defined via the internals of objects.
>

Plus the fact that loading a pickle can execute arbitrary code, and there
is no way to know
if any malicious code is in there in advance because the contents of the
pickle cannot
be easily inspected without loading/executing it.

> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing persistence code
> that
> does not fall in these problems is very costly in terms of developer time
> and makes it harder to add new methods or improve existing one. I am not
> excited about it.
>

My "text-based serialization" suggestion was nowhere near as ambitious as
that,
as I have already explained, and wasn't aiming at solving the versioning
issues, but
rather at having something which is "about as good" as pickle but in a
human-readable
format. I am not asking for a Turing-complete language to reproduce the
prediction
function, but rather something simple in the spirit of the output produced
by the gist code I linked above, just for the model families where it is
reasonable:

https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31

The code I posted mostly works (specific cases of nested models need to be
addressed
separately, as well as pipelines), and we have been using (a version of) it
in production
for quite some time. But there are hackish aspects to it that we are not
happy with,
such as the manual separation of init and fitted parameters by checking if
the name ends with "_", having to infer class name and location using
"model.__class__.__name__" and "model.__module__", and the wacky use of
"__import__".

My suggestion was more along the lines of adding some metadata to sklearn
estimators so
that a code in a similar style would be nicer to write; little things like
having a `init_parameters` and `fit_parameters` properties that would
return the lists of named parameters,
or a `model_info` method that would return data like sklearn version, class
name and location, or a package level dictionary pointing at the estimator
classes by a string name, like

from sklearn.linear_models import LogisticRegression
estimator_classes = {"LogisticRegression": LogisticRegression, ...}

so that one can load the appropriate class from the string description
without calling __import__ or eval; that sort of stuff.

I am aware this would not address the common complain of "prefect
prediction reproducibility"
across versions, but I think we can all agree that this utopia of perfect
reproducibility is not
feasible.

And in the long, long run, I agree that PFA/onnx or whichever similar
format that emerges, is
the way to go.

J
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Sebastian Raschka

The ONNX-approach sounds most promising, esp. because it will also allow 
library interoperability but I wonder if this is for parametric models only and 
not for the nonparametric ones like KNN, tree-based classifiers, etc.

All-in-all I can definitely see the appeal for having a way to export sklearn 
estimators in a text-based format (e.g., via JSON), since it would make sharing 
code easier. This doesn't even have to be compatible with multiple sklearn 
versions. A typical use case would be to include these JSON exports as e.g., 
supplemental files of a research paper for other people to run the models etc. 
(here, one can just specify which sklearn version it would require; of course, 
one could also share pickle files, by I am personally always hesitant reg. 
running/trusting other people's pickle files).

Unfortunately though, as Gael pointed out, this "feature" would be a huge 
burden for the devs, and it would probably also negatively impact the 
development of scikit-learn itself because it imposes another design constraint.

However, I do think this sounds like an excellent case for a contrib project. 
Like scikit-export, scikit-serialize or sth like that.

Best,
Sebastian



> On Oct 3, 2018, at 5:49 AM, Javier López  wrote:
> 
> 
> On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux  
> wrote:
> The reason that pickles are brittle and that sharing pickles is a bad
> practice is that pickle use an implicitly defined data model, which is
> defined via the internals of objects.
> 
> Plus the fact that loading a pickle can execute arbitrary code, and there is 
> no way to know
> if any malicious code is in there in advance because the contents of the 
> pickle cannot
> be easily inspected without loading/executing it.
>  
> So, the problems of pickle are not specific to pickle, but rather
> intrinsic to any generic persistence code [*]. Writing persistence code that
> does not fall in these problems is very costly in terms of developer time
> and makes it harder to add new methods or improve existing one. I am not
> excited about it.
> 
> My "text-based serialization" suggestion was nowhere near as ambitious as 
> that,
> as I have already explained, and wasn't aiming at solving the versioning 
> issues, but
> rather at having something which is "about as good" as pickle but in a 
> human-readable
> format. I am not asking for a Turing-complete language to reproduce the 
> prediction
> function, but rather something simple in the spirit of the output produced by 
> the gist code I linked above, just for the model families where it is 
> reasonable:
> 
> https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
> 
> The code I posted mostly works (specific cases of nested models need to be 
> addressed 
> separately, as well as pipelines), and we have been using (a version of) it 
> in production
> for quite some time. But there are hackish aspects to it that we are not 
> happy with,
> such as the manual separation of init and fitted parameters by checking if 
> the name ends with "_", having to infer class name and location using 
> "model.__class__.__name__" and "model.__module__", and the wacky use of 
> "__import__".
> 
> My suggestion was more along the lines of adding some metadata to sklearn 
> estimators so
> that a code in a similar style would be nicer to write; little things like 
> having a `init_parameters` and `fit_parameters` properties that would return 
> the lists of named parameters, 
> or a `model_info` method that would return data like sklearn version, class 
> name and location, or a package level dictionary pointing at the estimator 
> classes by a string name, like
> 
> from sklearn.linear_models import LogisticRegression
> estimator_classes = {"LogisticRegression": LogisticRegression, ...}
> 
> so that one can load the appropriate class from the string description 
> without calling __import__ or eval; that sort of stuff.
> 
> I am aware this would not address the common complain of "prefect prediction 
> reproducibility"
> across versions, but I think we can all agree that this utopia of perfect 
> reproducibility is not 
> feasible.
> 
> And in the long, long run, I agree that PFA/onnx or whichever similar format 
> that emerges, is
> the way to go.
> 
> J
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-10-03 Thread Nick Pentreath

For ONNX you may be interested in https://github.com/onnx/onnxmltools -
which supports conversion of a few skelarn models to ONNX already.

However as far as I am aware, none of the ONNX backends actually support
the ONNX-ML extended spec (in open-source at least). So you would not be
able to actually do prediction I think...

As for PFA, to my current knowledge there is no library that does it yet.
Our own Aardpfark project (https://github.com/CODAIT/aardpfark) focuses on
SparkML export to PFA for now but would like to add sklearn support in the
future.


On Wed, 3 Oct 2018 at 20:07 Sebastian Raschka 
wrote:

> The ONNX-approach sounds most promising, esp. because it will also allow
> library interoperability but I wonder if this is for parametric models only
> and not for the nonparametric ones like KNN, tree-based classifiers, etc.
>
> All-in-all I can definitely see the appeal for having a way to export
> sklearn estimators in a text-based format (e.g., via JSON), since it would
> make sharing code easier. This doesn't even have to be compatible with
> multiple sklearn versions. A typical use case would be to include these
> JSON exports as e.g., supplemental files of a research paper for other
> people to run the models etc. (here, one can just specify which sklearn
> version it would require; of course, one could also share pickle files, by
> I am personally always hesitant reg. running/trusting other people's pickle
> files).
>
> Unfortunately though, as Gael pointed out, this "feature" would be a huge
> burden for the devs, and it would probably also negatively impact the
> development of scikit-learn itself because it imposes another design
> constraint.
>
> However, I do think this sounds like an excellent case for a contrib
> project. Like scikit-export, scikit-serialize or sth like that.
>
> Best,
> Sebastian
>
>
>
> > On Oct 3, 2018, at 5:49 AM, Javier López  wrote:
> >
> >
> > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux <
> gael.varoqu...@normalesup.org> wrote:
> > The reason that pickles are brittle and that sharing pickles is a bad
> > practice is that pickle use an implicitly defined data model, which is
> > defined via the internals of objects.
> >
> > Plus the fact that loading a pickle can execute arbitrary code, and
> there is no way to know
> > if any malicious code is in there in advance because the contents of the
> pickle cannot
> > be easily inspected without loading/executing it.
> >
> > So, the problems of pickle are not specific to pickle, but rather
> > intrinsic to any generic persistence code [*]. Writing persistence code
> that
> > does not fall in these problems is very costly in terms of developer time
> > and makes it harder to add new methods or improve existing one. I am not
> > excited about it.
> >
> > My "text-based serialization" suggestion was nowhere near as ambitious
> as that,
> > as I have already explained, and wasn't aiming at solving the versioning
> issues, but
> > rather at having something which is "about as good" as pickle but in a
> human-readable
> > format. I am not asking for a Turing-complete language to reproduce the
> prediction
> > function, but rather something simple in the spirit of the output
> produced by the gist code I linked above, just for the model families where
> it is reasonable:
> >
> > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31
> >
> > The code I posted mostly works (specific cases of nested models need to
> be addressed
> > separately, as well as pipelines), and we have been using (a version of)
> it in production
> > for quite some time. But there are hackish aspects to it that we are not
> happy with,
> > such as the manual separation of init and fitted parameters by checking
> if the name ends with "_", having to infer class name and location using
> > "model.__class__.__name__" and "model.__module__", and the wacky use of
> "__import__".
> >
> > My suggestion was more along the lines of adding some metadata to
> sklearn estimators so
> > that a code in a similar style would be nicer to write; little things
> like having a `init_parameters` and `fit_parameters` properties that would
> return the lists of named parameters,
> > or a `model_info` method that would return data like sklearn version,
> class name and location, or a package level dictionary pointing at the
> estimator classes by a string name, like
> >
> > from sklearn.linear_models import LogisticRegression
> > estimator_classes = {"LogisticRegression": LogisticRegression, ...}
> >
> > so that one can load the appropriate class from the string description
> without calling __import__ or eval; that sort of stuff.
> >
> > I am aware this would not address the common complain of "prefect
> prediction reproducibility"
> > across versions, but I think we can all agree that this utopia of
> perfect reproducibility is not
> > feasible.
> >
> > And in the long, long run, I agree that PFA/onnx or whichever similar
> format that emerges, is
> > the

[scikit-learn] [ANN] Scikit-learn 0.20.0

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

4 matches

Site Navigation

Mail list logo

Footer information