[scikit-learn] [ANN] Scikit-learn 0.20.0
Le 02/10/2018 à 16:46, Andreas Mueller a écrit : > Thank you for your feedback Alex! Thanks for answering ! > > On 10/02/2018 09:28 AM, Alex Garel wrote: >> >> * chunk processing (kind of handling streaming data) : when >> dealing with lot of data, the ability to fit_partial, then use >> transform on chunks of data is of good help. But it's not well >> exposed in current doc and API, >> > This has been discussed in the past, but it looks like no-one was > excited enough about it to add it to the roadmap. > This would require quite some additions to the API. Olivier, who has > been quite interested in this before now seems > to be more interested in integration with dask, which might achieve > the same thing. I've tried to use Dask on my side, but for now, though going quite ahead, I didn't suceed completly because (in my specific case) of memory issues (dask default schedulers do not specialize processes on tasks, and I had some memory consuming tasks but I didn't get far enough to write my own scheduler). However I might deal with that later (not writing a scheduler but sharing memory with mmap, in this case). But yes Dask is about the "chunk instead of really streaming" approach (which was my point). >> * and a lot of models do not support it, while they could. >> > Can you give examples of that? Hum I spoke maybe too fast ! Greping the code give me some example at least, and it's true that a DecisionTree does not hold it naturally ! >> * Also pipeline does not support fit_partial and there is not >> fit_transform_partial. >> > What would you expect those to do? Each step in the pipeline might > require passing over the whole dataset multiple times > before being able to transform anything. That basically makes the > current interface impossible to work with the pipeline. > Even if only a single pass of the dataset was required, that wouldn't > work with the current interface. > If we would be handing around generators that allow to loop over the > whole data, that would work. But it would be unclear > how to support a streaming setting. You're right, I didn't think hard enough about it ! BTW I made some test using generators and making fit / transform build pipelines that I consumed latter on (tried with plain iterators and streamz). It did work somehow, with much hacks, but in my specific case, performance where not good enough. (real problem was not framework performance, but my architecture where I realize, that constantly re-generating data instead of doing it once was not fast enough). So finally my points were not so good, but at least I did learn something ;-) Thanks for your time. -- Alexandre Garel tel : +33 7 68 52 69 07 / +213 656 11 85 10 skype: alexgarel / ring: ba0435e11af36e32e9b4eb13c19c52fd75c7b4b0 signature.asc Description: OpenPGP digital signature ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux wrote: > The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model, which is > defined via the internals of objects. > Plus the fact that loading a pickle can execute arbitrary code, and there is no way to know if any malicious code is in there in advance because the contents of the pickle cannot be easily inspected without loading/executing it. > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code > that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > My "text-based serialization" suggestion was nowhere near as ambitious as that, as I have already explained, and wasn't aiming at solving the versioning issues, but rather at having something which is "about as good" as pickle but in a human-readable format. I am not asking for a Turing-complete language to reproduce the prediction function, but rather something simple in the spirit of the output produced by the gist code I linked above, just for the model families where it is reasonable: https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 The code I posted mostly works (specific cases of nested models need to be addressed separately, as well as pipelines), and we have been using (a version of) it in production for quite some time. But there are hackish aspects to it that we are not happy with, such as the manual separation of init and fitted parameters by checking if the name ends with "_", having to infer class name and location using "model.__class__.__name__" and "model.__module__", and the wacky use of "__import__". My suggestion was more along the lines of adding some metadata to sklearn estimators so that a code in a similar style would be nicer to write; little things like having a `init_parameters` and `fit_parameters` properties that would return the lists of named parameters, or a `model_info` method that would return data like sklearn version, class name and location, or a package level dictionary pointing at the estimator classes by a string name, like from sklearn.linear_models import LogisticRegression estimator_classes = {"LogisticRegression": LogisticRegression, ...} so that one can load the appropriate class from the string description without calling __import__ or eval; that sort of stuff. I am aware this would not address the common complain of "prefect prediction reproducibility" across versions, but I think we can all agree that this utopia of perfect reproducibility is not feasible. And in the long, long run, I agree that PFA/onnx or whichever similar format that emerges, is the way to go. J ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
The ONNX-approach sounds most promising, esp. because it will also allow library interoperability but I wonder if this is for parametric models only and not for the nonparametric ones like KNN, tree-based classifiers, etc. All-in-all I can definitely see the appeal for having a way to export sklearn estimators in a text-based format (e.g., via JSON), since it would make sharing code easier. This doesn't even have to be compatible with multiple sklearn versions. A typical use case would be to include these JSON exports as e.g., supplemental files of a research paper for other people to run the models etc. (here, one can just specify which sklearn version it would require; of course, one could also share pickle files, by I am personally always hesitant reg. running/trusting other people's pickle files). Unfortunately though, as Gael pointed out, this "feature" would be a huge burden for the devs, and it would probably also negatively impact the development of scikit-learn itself because it imposes another design constraint. However, I do think this sounds like an excellent case for a contrib project. Like scikit-export, scikit-serialize or sth like that. Best, Sebastian > On Oct 3, 2018, at 5:49 AM, Javier López wrote: > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux > wrote: > The reason that pickles are brittle and that sharing pickles is a bad > practice is that pickle use an implicitly defined data model, which is > defined via the internals of objects. > > Plus the fact that loading a pickle can execute arbitrary code, and there is > no way to know > if any malicious code is in there in advance because the contents of the > pickle cannot > be easily inspected without loading/executing it. > > So, the problems of pickle are not specific to pickle, but rather > intrinsic to any generic persistence code [*]. Writing persistence code that > does not fall in these problems is very costly in terms of developer time > and makes it harder to add new methods or improve existing one. I am not > excited about it. > > My "text-based serialization" suggestion was nowhere near as ambitious as > that, > as I have already explained, and wasn't aiming at solving the versioning > issues, but > rather at having something which is "about as good" as pickle but in a > human-readable > format. I am not asking for a Turing-complete language to reproduce the > prediction > function, but rather something simple in the spirit of the output produced by > the gist code I linked above, just for the model families where it is > reasonable: > > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 > > The code I posted mostly works (specific cases of nested models need to be > addressed > separately, as well as pipelines), and we have been using (a version of) it > in production > for quite some time. But there are hackish aspects to it that we are not > happy with, > such as the manual separation of init and fitted parameters by checking if > the name ends with "_", having to infer class name and location using > "model.__class__.__name__" and "model.__module__", and the wacky use of > "__import__". > > My suggestion was more along the lines of adding some metadata to sklearn > estimators so > that a code in a similar style would be nicer to write; little things like > having a `init_parameters` and `fit_parameters` properties that would return > the lists of named parameters, > or a `model_info` method that would return data like sklearn version, class > name and location, or a package level dictionary pointing at the estimator > classes by a string name, like > > from sklearn.linear_models import LogisticRegression > estimator_classes = {"LogisticRegression": LogisticRegression, ...} > > so that one can load the appropriate class from the string description > without calling __import__ or eval; that sort of stuff. > > I am aware this would not address the common complain of "prefect prediction > reproducibility" > across versions, but I think we can all agree that this utopia of perfect > reproducibility is not > feasible. > > And in the long, long run, I agree that PFA/onnx or whichever similar format > that emerges, is > the way to go. > > J > ___ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
For ONNX you may be interested in https://github.com/onnx/onnxmltools - which supports conversion of a few skelarn models to ONNX already. However as far as I am aware, none of the ONNX backends actually support the ONNX-ML extended spec (in open-source at least). So you would not be able to actually do prediction I think... As for PFA, to my current knowledge there is no library that does it yet. Our own Aardpfark project (https://github.com/CODAIT/aardpfark) focuses on SparkML export to PFA for now but would like to add sklearn support in the future. On Wed, 3 Oct 2018 at 20:07 Sebastian Raschka wrote: > The ONNX-approach sounds most promising, esp. because it will also allow > library interoperability but I wonder if this is for parametric models only > and not for the nonparametric ones like KNN, tree-based classifiers, etc. > > All-in-all I can definitely see the appeal for having a way to export > sklearn estimators in a text-based format (e.g., via JSON), since it would > make sharing code easier. This doesn't even have to be compatible with > multiple sklearn versions. A typical use case would be to include these > JSON exports as e.g., supplemental files of a research paper for other > people to run the models etc. (here, one can just specify which sklearn > version it would require; of course, one could also share pickle files, by > I am personally always hesitant reg. running/trusting other people's pickle > files). > > Unfortunately though, as Gael pointed out, this "feature" would be a huge > burden for the devs, and it would probably also negatively impact the > development of scikit-learn itself because it imposes another design > constraint. > > However, I do think this sounds like an excellent case for a contrib > project. Like scikit-export, scikit-serialize or sth like that. > > Best, > Sebastian > > > > > On Oct 3, 2018, at 5:49 AM, Javier López wrote: > > > > > > On Tue, Oct 2, 2018 at 5:07 PM Gael Varoquaux < > gael.varoqu...@normalesup.org> wrote: > > The reason that pickles are brittle and that sharing pickles is a bad > > practice is that pickle use an implicitly defined data model, which is > > defined via the internals of objects. > > > > Plus the fact that loading a pickle can execute arbitrary code, and > there is no way to know > > if any malicious code is in there in advance because the contents of the > pickle cannot > > be easily inspected without loading/executing it. > > > > So, the problems of pickle are not specific to pickle, but rather > > intrinsic to any generic persistence code [*]. Writing persistence code > that > > does not fall in these problems is very costly in terms of developer time > > and makes it harder to add new methods or improve existing one. I am not > > excited about it. > > > > My "text-based serialization" suggestion was nowhere near as ambitious > as that, > > as I have already explained, and wasn't aiming at solving the versioning > issues, but > > rather at having something which is "about as good" as pickle but in a > human-readable > > format. I am not asking for a Turing-complete language to reproduce the > prediction > > function, but rather something simple in the spirit of the output > produced by the gist code I linked above, just for the model families where > it is reasonable: > > > > https://gist.github.com/jlopezpena/2cdd09c56afda5964990d5cf278bfd31 > > > > The code I posted mostly works (specific cases of nested models need to > be addressed > > separately, as well as pipelines), and we have been using (a version of) > it in production > > for quite some time. But there are hackish aspects to it that we are not > happy with, > > such as the manual separation of init and fitted parameters by checking > if the name ends with "_", having to infer class name and location using > > "model.__class__.__name__" and "model.__module__", and the wacky use of > "__import__". > > > > My suggestion was more along the lines of adding some metadata to > sklearn estimators so > > that a code in a similar style would be nicer to write; little things > like having a `init_parameters` and `fit_parameters` properties that would > return the lists of named parameters, > > or a `model_info` method that would return data like sklearn version, > class name and location, or a package level dictionary pointing at the > estimator classes by a string name, like > > > > from sklearn.linear_models import LogisticRegression > > estimator_classes = {"LogisticRegression": LogisticRegression, ...} > > > > so that one can load the appropriate class from the string description > without calling __import__ or eval; that sort of stuff. > > > > I am aware this would not address the common complain of "prefect > prediction reproducibility" > > across versions, but I think we can all agree that this utopia of > perfect reproducibility is not > > feasible. > > > > And in the long, long run, I agree that PFA/onnx or whichever similar > format that emerges, is > > the