Re: [scikit-learn] baggingClassifier with pipeline

2019-06-28 Thread Manuel CASTEJÓN LIMAS via scikit-learn
You can always add a first step that turns you numpy array into a DataFrame
such as the one required afterwards.
A bit of object oriented programming might be required though, for deriving
you class from BaseTransformer and writing you particular code for fit and
transform method.
Alternatively you can try the PipeGraph library for dealing with those
complex routes.
Best
Manuel
Disclaimer: yes, I'm a coauthour of the PipeGraph library.

El vie., 28 jun. 2019 7:28, Roxana Danger 
escribió:

> Hello,
> I would like to use the BaggingClassifier whose base estimator is a
> pipeline with multiple transformations including a DataFrameMapper from
> sklearn_pandas.
> I am getting an error during the fitting the DataFrameMapper as the first
> step of the BaggingClassifier is to convert the DataFrame to an array (see
> in BaseBagging._fit method). Similar problem happen using directly
> sklearn.Pipeline instead of the DataFrameMapper. in both cases, a DataFrame
> is expected as input, but, instead, an array is provided to the Pipeline.
>
> Is there anyway I can overcome this problem?
>
> Many thanks,
> Roxana
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph example: KMeans + LDA

2018-10-29 Thread Manuel Castejón Limas
The long story short: Thank you for your time & sorry for inaccuracies; a
few words selling a modular approach to your developments; and a request on
your opinion on parallelizing Pipegraph using dask.

Thank you Andreas for your patience showing me the sklearn ways. I admit
that I'm still learning scikit-learn capabilities which is a tough thing as
you all continue improving the library as in this new release. Keep up the
good work with your developments and your teaching to the community. In
particular, I learned A LOT with your answer. Big thanks!

I'm inlining my comments:

()

> - KMeans is not a transformer but an estimator
>
> KMeans is a transformer in sklearn:
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform
>
> (you can't get the labels to be the output which is what you're doing
> here, but it is a transformer)
>
My bad! I saw the predict method and did not check the source code: It is
true, from the code
class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
The point was that, as you guessed, one cannot put it in a pipeline a
KMeans followed by a LDA just like that without additional efforts.


> - LDA score function requires the y parameter, while its input does not
> come from a known set of labels, but from the previous KMeans
> - Moreover, the GridSearchCV.fit call would also require a 'y' parameter
>
> Not true if you provide a scoring that doesn't require y or if you don't
> specify scoring and the scoring method of the estimator doesn't require y.
>
> GridSearchCV.fit doesn't require y.
>
My bad again. I wanted to mean that without the scoring function and the CV
iterator that you use below, gridsearchCV will call the scoring function
of the final step, i.e. LDA, and LDA scoring function wants a y. But
please, bear with me, I simply did not know the proper hacks.The test does
not lie, I'm quite a newbie then :-)

Can you provide a citation for that? That seems to heavily depend on the
> clustering algorithms and the classifier.
> To me, stability scoring seems more natural:
> https://arxiv.org/abs/1007.1075
>
> Good to know, thank you for the reference. You are right about the
dependance, it's all about the nature of the clustering and the classifier;
but I was just providing a scenario, not necessarily advocating for this
strategy as the solution to the number of clusters question.

It's cool that this is possible, but I feel this is still not really a
> "killer application" in that this is not a very common pattern.
>
IMHO, the beauty of the example, if there is any :-D,  was the simplicity
and brevity. I agree that it is not a killer application, just a possible
situation.

> Though I acknowledge that your code only takes 4 lines, while mine takes 8
> (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines
> :P)
>
I think pipegraph is cool, not meaning to give you a hard time ;)
>

Thank you again for your time. The thing is that I believe PipeGraph can be
useful for you in terms of approaching your models following a modular
approach. I'm going to work on a second example implementing something
similar to the VotingClassifier class to show you the approach.
The main weakness is the lack of parallelism in the inner working of
PipeGraph, which was never a concern for me since as far as GridSearchCV
can parallelize the training I was ok with that grain size. But, now, I
reckon that paralellization can be useful for you in term of approaching
your models as a PipeGraph and having Parallelization for free without
having to directly call joblib (thank you joblib authors for such goodie).
I guess that providing a dask backend for pipegraph would be nice. But let
me continue with this issue after sending the VotingClassifier example :-)

Thanks, truly, I need to study hard!
Manuel
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Pipegraph example: KMeans + LDA

2018-10-24 Thread Manuel Castejón Limas
Dear all,
as a way of improving the documentation of PipeGraph we intend to provide
more examples of its usage. It was a popular demand to show application
cases to motivate its usage, so here it is a very simple case with two
steps: a KMeans followed by a LDA.

https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py

This short example points out the following challenges:
- KMeans is not a transformer but an estimator
- LDA score function requires the y parameter, while its input does not
come from a known set of labels, but from the previous KMeans
- Moreover, the GridSearchCV.fit call would also require a 'y' parameter
- It would be nice to have access to the output of the KMeans step as well.

PipeGraph is capable of addressing these challenges.

The rationale for this example lies in the identification-reconstruction
realm. In a scenario where the class labels are unknown, we might want to
associate the quality of the clustering structure to the capability of a
later model to be able to reconstruct this structure. So the basic idea
here is that if LDA is capable of getting good results it was because the
information of the KMeans was good enough for that purpose, hinting the
discovery of a good structure.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25

2018-10-08 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Good to know!

El lun., 8 oct. 2018 9:08, Joel Nothman  escribió:

> Just a note that multiple layers of stacking can be achieved with
> StackingClassifier using nesting.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn Digest, Vol 30, Issue 25

2018-10-02 Thread Manuel CASTEJÓN LIMAS via scikit-learn
I would propose PipeGraph for stacking, it comes natural and it could help
a lot in making things easier for core developers.

Disclaimer: I'm coauthor of PipeGraph


Manuel Castejón Limas

Escuela de Ingenierías Industrial, Informática y Aeroespacial

Universidad de León

Campus de Vegazana sn.

24071. León. Spain.

e-mail: manuel.caste...@unileon.es

Tel.: +34 987 291 779



Aviso de confidencialidad <https://www.unileon.es/mail-disclaimer/20180525>

Confidentiality Notice <https://www.unileon.es/mail-disclaimer/20180525>




El mar., 2 oct. 2018 a las 3:13, Jason Sanchez (<2jasonsanc...@gmail.com>)
escribió:

> The current roadmap is amazing. One feature that would be exciting is
> better support for multilayer stacking with caching and the ability to add
> models to already trained layers.
>
> I saw this history: https://github.com/scikit-learn/scikit-learn/pull/8960
>
> This library is very close:
> * API is somewhat awkward, but otherwise good. Does not cache intermediate
> steps. https://wolpert.readthedocs.io/en/latest/index.html
>
> These solutions seem to allow only two layers:
> *
> https://github.com/scikit-learn/scikit-learn/issues/4816#issuecomment-217817717
> *
> https://rasbt.github.io/mlxtend/user_guide/classifier/StackingClassifier/
> * https://github.com/scikit-learn/scikit-learn/pull/6674
>
> The people who put these other libraries together have made an incredibly
> welcome effort to solve a real need and it would be amazing to see a payoff
> for their effort in the form of an addition of stacking to scikit-learn's
> core library.
>
> As another data point, I attached a simple implementation I put together
> to illustrate what I think are core needs of this feature. Feel free to
> browse the code. Here is the short list:
> * Infinite layers (or at least 3 ;) )
> * Choice of CV or OOB for each model
> * Ability to add a new model to a layer after the stacked ensemble has
> been trained and refit the pipeline such that only models that must be
> retrained are retrained (i.e. train the added model and retrain all models
> in higher layers)
> * All standard scikit-learn pipeline goodness (introspection, grid search,
> serializability, etc)
>
> Thanks all! This library is making a real difference for good in the lives
> of many people.
>
> Jason
>
>
> On Fri, Sep 28, 2018 at 11:35 AM  wrote:
>
>> Send scikit-learn mailing list submissions to
>> scikit-learn@python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>> scikit-learn-requ...@python.org
>>
>> You can reach the person managing the list at
>> scikit-learn-ow...@python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>1. Re: [ANN] Scikit-learn 0.20.0 (Sebastian Raschka)
>>2. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller)
>>3. Re: [ANN] Scikit-learn 0.20.0 (Andreas Mueller)
>>4. Re: [ANN] Scikit-learn 0.20.0 (Manuel CASTEJ?N LIMAS)
>>
>>
>> --
>>
>> Message: 1
>> Date: Fri, 28 Sep 2018 11:10:50 -0500
>> From: Sebastian Raschka 
>> To: Scikit-learn mailing list 
>> Subject: Re: [scikit-learn] [ANN] Scikit-learn 0.20.0
>> Message-ID:
>> 
>> Content-Type: text/plain;   charset=us-ascii
>>
>> >
>> > > I think model serialization should be a priority.
>> >
>> > There is also the ONNX specification that is gaining industrial
>> adoption and that already includes open source exporters for several
>> families of scikit-learn models:
>> >
>> > https://github.com/onnx/onnxmltools
>>
>>
>> Didn't know about that. This is really nice! What do you think about
>> referring to it under
>> http://scikit-learn.org/stable/modules/model_persistence.html to make
>> people aware that this option exists?
>> Would be happy to add a PR.
>>
>> Best,
>> Sebastian
>>
>>
>>
>> > On Sep 28, 2018, at 9:30 AM, Olivier Grisel 
>> wrote:
>> >
>> >
>> > > I think model serialization should be a priority.
>> >
>> > There is also the ONNX specification that is gaining industrial
>> adoption and that already includes open source exporters for several
>> families of scikit-learn models:
>> >
&g

Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Manuel CASTEJÓN LIMAS via scikit-learn
How about a docker based approach? Just thinking out loud
Best
Manuel

El vie., 28 sept. 2018 19:43, Andreas Mueller  escribió:

>
>
> On 09/28/2018 01:38 PM, Andreas Mueller wrote:
> >
> >
> > On 09/28/2018 12:10 PM, Sebastian Raschka wrote:
>  I think model serialization should be a priority.
> >>> There is also the ONNX specification that is gaining industrial
> >>> adoption and that already includes open source exporters for several
> >>> families of scikit-learn models:
> >>>
> >>> https://github.com/onnx/onnxmltools
> >>
> >> Didn't know about that. This is really nice! What do you think about
> >> referring to it under
> >> http://scikit-learn.org/stable/modules/model_persistence.html to make
> >> people aware that this option exists?
> >> Would be happy to add a PR.
> >>
> >>
> > I don't think an open source runtime has been announced yet (or they
> > didn't email me like they promised lol).
> > I'm quite excited about this as well.
> >
> > Javier:
> > The problem is not so much storing the "model" but storing how to make
> > predictions. Different versions could act differently
> > on the same data structure - and the data structure could change. Both
> > happen in scikit-learn.
> > So if you want to make sure the right thing happens across versions,
> > you either need to provide serialization and deserialization for
> > every version and conversion between those or you need to provide a
> > way to store the prediction function,
> > which basically means you need a turing-complete language (that's what
> > ONNX does).
> >
> > We basically said doing the first is not feasible within scikit-learn
> > given our current amount of resources, and no-one
> > has even tried doing it outside of scikit-learn (which would be
> > possible).
> > Implementing a complete prediction serialization language (the second
> > option) is definitely outside the scope of sklearn.
> >
> >
> Maybe we should add to the FAQ why serialization is hard?
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] Scikit-learn 0.20.0

2018-09-28 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Huge huge Thank you developers!
Keep up the good work!

El mié., 26 sept. 2018 20:57, Andreas Mueller  escribió:

> Hey everbody!
> I'm happy to (finally) announce scikit-learn 0.20.0.
> This release is dedicated to the memory of Raghav Rajagopalan.
>
> You can upgrade now with pip or conda!
>
> There is many important additions and updates, and you can find the full
> release notes here:
> http://scikit-learn.org/stable/whats_new.html#version-0-20
>
> My personal highlights are the ColumnTransformer and the changes to
> OneHotEncoder,
> but there's so much more!
>
> An important note is that this is the last version to support Python2.7,
> and the
> next release will require Python 3.5.
>
> A big thank you to everybody who contributed and special thanks to Joel!
>
> All the best,
> Andy
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] CircleCI

2018-09-10 Thread Manuel Castejón Limas
Thank you for your feedback Guillaume!

I'm fighting with CircleCI configuration yet, but it seems it will be
possible to test the windows versions as well, provided that nowadays we
already have windows containers available.
I'll give it a try and then I will share the experience.
For what I've seen in scikit-learn build_tools scripts, there is room for
speeding up things by integrating the miniconda installation in the docker
image and not on every build job.

Best
Manuel


El jue., 6 sept. 2018 a las 14:24, Guillaume Lemaître (<
g.lemaitr...@gmail.com>) escribió:

> Hi Manuel,
>
> Basically, you are free to take any initiative with your CIs until it
> is cross-platform tested. Using the different CI available allows to
> speed-up the testing. In scikit-learn, we use Travis for Linux
> checking, Appveyor for Windows, and CircleCI for building the
> documentation. You could use a single CI service for all of those.
> However, I am not sure that you have Windows support apart of
> Appveyor.
>
> I think that we should update the template of the scikit-learn-contrib
> with the new template for circle ci 2.
>
> Cheers,
> On Thu, 6 Sep 2018 at 13:16, Manuel CASTEJÓN LIMAS via scikit-learn
>  wrote:
> >
> > Dear all,
> > Contrib projects template hints the authors to use TravisCI, CircleCI
> and Appveyor. Now that CircleCI has moved to version 2, is there any idea
> on what to do about it? Will the template be updated? Is it ok if we use
> only CircleCI?
> > What do you, core devs, suggest about that?
> > Best wishes
> > Manuel
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] CircleCI

2018-09-06 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Dear all,
Contrib projects template hints the authors to use TravisCI, CircleCI and
Appveyor. Now that CircleCI has moved to version 2, is there any idea on
what to do about it? Will the template be updated? Is it ok if we use only
CircleCI?
What do you, core devs, suggest about that?
Best wishes
Manuel
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined.

2018-05-18 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Dear Joel,

I've changed the code of PipeGraph in order to change the old wrappers to
new Mixin Classes. The changes are reflected in this MixinClasses branch:

https://github.com/mcasl/PipeGraph/blob/feature/MixinClasses/pipegraph/adapters.py

My conclusions are that although both approaches are feasible and provide
similar functionality, Mixin Classes provide a simpler solution. Following
the 'flat is better than nested' principle, the mixin classes should be
favoured.
This approach seems as well to be more in line with general
sklearn development practice, so I'll make the necessary changes to the
docs and then the master branch will be replaced with this new Mixin
classes version.

Thanks for pointing out this issue!
Best
Manuel

2018-04-16 14:21 GMT+02:00 Manuel CASTEJÓN LIMAS :

> Nope! Mostly because of lack of experience with mixins.
> I've done some reading and I think I can come up with a few mixins doing
> the trick by dynamically adding their methods to an already instantiated
> object. I'll play with that and I hope to show you something soon! Or at
> least I will have better grounds to make an educated decision.
> Best
> Manuel
>
>
>
>
> Manuel Castejón Limas
> *Escuela de Ingeniería Industrial e Informática*
> Universidad de León
> Campus de Vegazana sn.
> 24071. León. Spain.
> *e-mail: *manuel.caste...@unileon.es
> *Tel.*: +34 987 291 946
>
> Digital Business Card: Click Here <http://qrs.ly/5c3jpaj>
>
>
>
> 2018-04-15 15:18 GMT+02:00 Joel Nothman :
>
>> Have you considered whether a mixin is a better model than a wrapper?​
>>
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined.

2018-04-16 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Nope! Mostly because of lack of experience with mixins.
I've done some reading and I think I can come up with a few mixins doing
the trick by dynamically adding their methods to an already instantiated
object. I'll play with that and I hope to show you something soon! Or at
least I will have better grounds to make an educated decision.
Best
Manuel




Manuel Castejón Limas
*Escuela de Ingeniería Industrial e Informática*
Universidad de León
Campus de Vegazana sn.
24071. León. Spain.
*e-mail: *manuel.caste...@unileon.es
*Tel.*: +34 987 291 946

Digital Business Card: Click Here <http://qrs.ly/5c3jpaj>



2018-04-15 15:18 GMT+02:00 Joel Nothman :

> Have you considered whether a mixin is a better model than a wrapper?​
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Delegating "get_params" and "set_params" to a wrapped estimator when parameter is not defined.

2018-04-14 Thread Manuel CASTEJÓN LIMAS via scikit-learn
Hi Javier!
Yo can have a look at:

https://github.com/mcasl/PipeGraph/blob/master/pipegraph/adapters.py

There are a few adapters there and I had tool deal with that situation. I
solved it by using __getattr__ and __setattr__.
Best
Manolo

El vie., 13 abr. 2018 17:53, Javier López  escribió:

> I have a class `FancyEstimator(BaseEstimator, MetaEstimatorMixin): ...`
> that wraps
> around an arbitrary sklearn estimator to add some functionality I am
> interested about.
> This class contains an attribute `self.estimator` that contains the
> wrapped estimator.
> Delegation of the main methods, such as `fit`, `transform` works just
> fine, but I am
> having some issues with `get_params` and `set_params`.
>
> The main idea is, I would like to use my wrapped class as a drop-in
> replacement for
> the original estimator, but this raises some issues with some functions
> that try using the `get_params` and `set_params` straight in my class, as
> the original
> parameters now have prefixed names (for instance `estimator__verbose`
> instead of `verbose`)
> I would like to delegate calls of set_params and get_params in a smart way
> so that if a
> parameter is unknown for my wrapper class, then it automatically goes
> looking for it in
> the wrapped estimator.
>
>  I am not concerned about my class parameter names as there are only a
> couple of very
> specific names on it, so it is safe to assume that any unknown parameter
> name should
> refer to the base estimator. Is there an easy way of doing that?
>
> Cheers,
> J
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Pipegraph feedback

2018-04-13 Thread Manuel Castejón Limas
Hi all!

As you know by now :-) we submitted PipeGraph as a contrib-project proposal.
We believe that this tool can be interesting not only for end users wanting
to encapsulate their arbitrarily complex workflows but also for sklearn
developers as some internal developments could be easily expressed as
pipegraphs, for example, ensemble methods.


We would love to have some feedback in terms of:
- whether we would have to change anything in order to be more in line with
sklearn's philosophy
- any development you core developers are working on that could be treated
as a pipegraph
- possible scenarios not implemented yet by pipegraph, such as recurrent
graphs, that might be potentially useful.

Moreover, in case any core developer is interested in joining the project
you are more than welcome! This would provide a great opportunity for
collaboration!

Best wishes
Manuel Castejón-Limas
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] PipeGraph users guide

2018-03-17 Thread Manuel Castejón Limas
Dear all,
we have written a users guide to PipeGraph in order to help the interested
readers to better understand how it works.

While we improve the rst export (the figures are missing) the best version
is the original jupyter notebook:

*https://github.com/mcasl/PipeGraph/blob/master/doc/User_Guide.ipynb
*

Best
Manolo
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New Transformer

2018-02-28 Thread Manuel Castejón Limas
Dear David,

We recently submitted PipeGraph as a sklearn contrib project. Even though
it is an ongoing project and we are right now modifying the interface in
order to make it more suitable and useful for the sklearn community, I
believe that the problems that you explain can be addressed by PipeGraph.
If you need the possibility of defining different/equal transformations for
X and y you can do it by simply defining different steps for each path;
if you need different paths for fit and predict it is also possible to
define them in PipeGraph.
Please have a look at the general examples and judge by yourself if it fits
your needs:

https://mcasl.github.io/PipeGraph/auto_examples/plot_4_example_combination_of_classifiers.html#sphx-glr-auto-examples-plot-4-example-combination-of-classifiers-py

You can play with it using pip, for example:

pip install pipegraph

The API can be considered far from stable and we are following the advice
of the sklearn community to turn it into something as useful as possible,
but it is my humble opinion that in situations like this PipeGraph can
provide a suitable solution.
Best
Manolo



Best regards


2018-02-27 19:42 GMT+01:00 Guillaume Lemaître :

> Transforming y is a big deal :)
> You can refer to https://github.com/scikit-learn/enhancement_proposals/
> pull/2
> and the associated issues/PR to see what is going on. This is probably an
> additional use case to think about when designing estimator which will be
> modifying y.
>
> Regarding the pipeline, I assume that your strategy would be to resample
> at fit
> and do nothing at predict, isn't it?
>
> NB: you could actually implement this sampling in a FunctionSampler of
> imblearn:
> http://contrib.scikit-learn.org/imbalanced-learn/dev/generated/imblearn.
> FunctionSampler.html#imblearn.FunctionSampler
> and then use the imblearn pipeline which would apply the transform at fit
> time but not
> at predict.
>
> On 27 February 2018 at 18:02, David Burns 
> wrote:
>
>> First post on this mailing list.
>>
>> I have been working with time series data for a project, and thought I
>> could contribute a new transformer to segment time series data using a
>> sliding window, with variable overlap. I have attached demonstration of how
>> this would fit in the existing framework. The only challenge for me here is
>> that the transformer needs to transform both the X and y variable in order
>> to perform the segmentation. I am not sure from the documentation how to
>> implement this in the framework.
>>
>> Overlapping segments is a great way to boost performance for time series
>> classifiers, so this may be a worthwhile contribution for some in this area
>> of ML. Ultimately, model_selection.TimeSeries.Split would need to be
>> modified to support overlapping segments, or a new class created to enable
>> validation for this.
>>
>> Please let me know if this would be a worthwhile contribution, and if so
>> how to go about transforming the target vector y in the framework /
>> pipeline?
>>
>> Thanks!
>>
>> David Burns
>>
>>
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] PieGraph: First examples and documentation

2018-02-16 Thread Manuel Castejón Limas
Dear all,
We have produced some documentation for the PipeGraph module. Essentially
it consists of the API for the two main interfaces: PipeGraphRegressor and
PipeGraphClassifier.

I guess that at this point the best experience comes from reading the
examples and watching the diagrams.

These examples are more suggestive than exhaustive though. Our purpose is
to present the project in this initial form in order to hear all your
comments for making it as useful for you all as possible.

These are the links:
- The documentation:
https://mcasl.github.io/PipeGraph/auto_examples/index.html

- The module sources: https://mcasl.github.io/PipeGraph/


Best
Manuel
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph is on its way!

2018-02-12 Thread Manuel Castejón Limas
While we keep working on the docs and figures, here is a little example you
all can already run:

import numpy as np
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import GridSearchCV
from pipegraph.pipeGraph import PipeGraphClassifier, Concatenator
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

iris = load_iris()
X = iris.data
y = iris.target

scaler = MinMaxScaler()
gaussian_nb = GaussianNB()
svc = SVC()
mlp = MLPClassifier()
concatenator = Concatenator()

steps = [('scaler', scaler),
 ('gaussian_nb', gaussian_nb),
 ('svc', svc),
 ('concat', concatenator),
 ('mlp', mlp)]

connections = { 'scaler': {'X': 'X'},
'gaussian_nb': {'X': ('scaler', 'predict'),
'y': 'y'},
'svc':  {'X': ('scaler', 'predict'),
 'y': 'y'},
'concat': {'X1': ('scaler', 'predict'),
   'X2': ('gaussian_nb', 'predict'),
   'X3': ('svc', 'predict')},
'mlp': {'X': ('concat', 'predict'),
'y': 'y'}
}

param_grid = {'svc__C': [0.1, 0.5, 1.0],
  'mlp__hidden_layer_sizes': [(3,), (6,), (9,),],
  'mlp__max_iter': [5000, 1]}

pgraph = PipeGraphClassifier(steps=steps, connections=connections)
grid_search_classifier  = GridSearchCV(estimator=pgraph,
param_grid=param_grid, refit=True)
grid_search_classifier.fit(X, y)
y_pred = grid_search_classifier.predict(X)
grid_search_regressor.best_estimator_.get_params()

---
'predict' is the default output name. One of these days we will simplify
the notation to simply the name of the node in case of default output names.

Best wishes
Manuel

2018-02-07 23:29 GMT+01:00 Andreas Mueller :

> Thanks Manuel, that looks pretty cool.
> Do you have a write-up about it? I don't entirely understand the
> connections setup.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] PipeGraph examples: areas of interest

2018-02-10 Thread Manuel Castejón Limas
Hi all!
The good news is that we made GridSearchCv work on PipeGraph!

In order to create diverse examples, we welcome some feedback on which
other libraries you use in order to acquire/process data before applying
scikit learn.

For example: 'I work in computer vision and I usually get image feature
using the python bindings provided by OpenCV.'

This will help us provide interesting examples.

Best
Manuel
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph is on its way!

2018-02-08 Thread Manuel Castejón Limas
Docs are coming soon. In the meantime

, Imagine a first step containing a TrainTestSplit class with a similar
behaviour to train_test_split but capable of producing results by using fit
and predict (this is a goodie). The inputs will be X, y, z, ... , and the
outputs the same names + _train and _test.

A second step could be a MinMaxScaler taking only X_train.

A third step a linear model using the output from MinMaxScaler as X.

This would be written:
connections['split'] =  {'A': 'X', 'B': 'y'}
Meaning that the 'split' step will use the X and y from the fit or predict
call calling them A and B internally.

If you use, for instance,

 my_pipegraph.fit(X=myX, y=myY)

This step will produce A_train with a piece of myX

You can use this later:
connections['scaler'] = { 'X': ('split', 'A_train')}
Expressing that the output A_train from the split step will be use as input
X for the scaler. The output from this step is called 'predict'

Finally, for the third step:
connections['linear_model'] ={'X': ('scaler', 'predict'), 'y': ('split',
'B_train')}

Notice, that if we are talking about an external input variable we don't
use a tuple.
So the syntax is something like connection[step_label] =
{internal_variable: (input_step, variable_there)}

Docs are coming anyway. Travis CI, Circle CI and Appveyor have been
successfully activated at GitHub.com/mcasl/PipeGraph

Sorry if you found mistypos, I use my smartphone for replying.
Best
Manuel


El 7 feb. 2018 11:32 p. m., "Andreas Mueller"  escribió:

> Thanks Manuel, that looks pretty cool.
> Do you have a write-up about it? I don't entirely understand the
> connections setup.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] clustering on big dataset

2018-02-07 Thread Manuel Castejón Limas
Hope this helps!

Manuel


@Article{Ciampi2008,

author="Ciampi, Antonio

and Lechevallier, Yves

and Limas, Manuel Castej{\'o}n

and Marcos, Ana Gonz{\'a}lez",

title="Hierarchical clustering of subpopulations with a dissimilarity based
on the likelihood ratio statistic: application to clustering massive data
sets",

journal="Pattern Analysis and Applications",

year="2008",

month="Jun",

day="01",

volume="11",

number="2",

pages="199--220",

abstract="The problem of clustering subpopulations on the basis of samples
is considered within a statistical framework: a distribution for the
variables is assumed for each subpopulation and the dissimilarity between
any two populations is defined as the likelihood ratio statistic which
compares the hypothesis that the two subpopulations differ in the parameter
of their distributions to the hypothesis that they do not. A general
algorithm for the construction of a hierarchical classification is
described which has the important property of not having inversions in the
dendrogram. The essential elements of the algorithm are specified for the
case of well-known distributions (normal, multinomial and Poisson) and an
outline of the general parametric case is also discussed. Several
applications are discussed, the main one being a novel approach to dealing
with massive data in the context of a two-step approach. After clustering
the data in a reasonable number of `bins' by a fast algorithm such as
k-Means, we apply a version of our algorithm to the resulting bins.
Multivariate normality for the means calculated on each bin is assumed:
this is justified by the central limit theorem and the assumption that each
bin contains a large number of units, an assumption generally justified
when dealing with truly massive data such as currently found in modern data
analysis. However, no assumption is made about the data generating
distribution.",

issn="1433-755X",

doi="10.1007/s10044-007-0088-4",

url="https://doi.org/10.1007/s10044-007-0088-4";

}





2018-01-04 12:55 GMT+01:00 Joel Nothman :

> Can you use nearest neighbors with a KD tree to build a distance matrix
> that is sparse, in that distances to all but the nearest neighbors of a
> point are (near-)infinite? Yes, this again has an additional parameter
> (neighborhood size), just as BIRCH has its threshold. I suspect you will
> not be able to improve on having another, approximating, parameter. You do
> not need to set n_clusters to a fixed value for BIRCH. You only need to
> provide another clusterer, which has its own parameters, although you
> should be able to experiment with different "global clusterers".
>
> On 4 January 2018 at 11:04, Shiheng Duan  wrote:
>
>> Yes, it is an efficient method, still, we need to specify the number of
>> clusters or the threshold. Is there another way to run hierarchy clustering
>> on the big dataset? The main problem is the distance matrix.
>> Thanks.
>>
>> On Tue, Jan 2, 2018 at 6:02 AM, Olivier Grisel 
>> wrote:
>>
>>> Have you had a look at BIRCH?
>>>
>>> http://scikit-learn.org/stable/modules/clustering.html#birch
>>>
>>> --
>>> Olivier
>>> ​
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Pipegraph is on its way!

2018-02-07 Thread Manuel Castejón Limas
Dear all,

after some playing with the concept we have developed a module for
implementing the functionality of Pipeline in more general contexts as
first introduced in a former thread ( https://mail.python.org/
pipermail/scikit-learn/2018-January/002158.html )

In order to expand the possibilities of Pipeline for non linearly
sequential workflows a graph like structure has been deployed while keeping
as much as possible the already known syntax we all love and honor:

X = pd.DataFrame(dict(X=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
y = 2 * X
sc = MinMaxScaler()
lm = LinearRegression()
steps = [('scaler', sc),
 ('linear_model', lm)]
connections = {'scaler': dict(X='X'),
   'linear_model': dict(X=('scaler', 'predict'),
y='y')}
pgraph = PipeGraph(steps=steps,
   connections=connections,
   use_for_fit='all',
   use_for_predict='all')

As you can see the biggest difference for the final user is the dictionary
describing the connections.

Another major contribution for developers wanting to expand scikit learn is
a collection of adapters for scikit learn models in order to provide them a
common API irrespectively of whether they originally implemented predict,
transform or fit_predict as an atomic operation without predict. These
adapters accept as many positional or keyword parameters in their fit
predict methods through *pargs and **kwargs.

As general as PipeGraph is, it cannot work under the restrictions imposed
by GridSearchCV on the input parameters, namely X and y since PipeGraph can
accept as many input signals as needed. Thus, an adhoc GridSearchCv version
is also needed and we will provide a basic initial version in a later
version.

We need to write the documentation and we will propose it as a
contrib-project in a few days.

Best wishes,
Manuel Castejón-Limas
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2018-01-08 Thread Manuel Castejón Limas
Just a quick ping to share that I've kept playing with this PipeGraph toy.
The following example reflects its current state.

* As you can see scikit-learn models can be used as steps in the nodes of
the graph just by saying so, for example:

'Gaussian_Mixture':
{'step': GaussianMixture,
 'kargs': {'n_components': 3},
 'connections': {'X': ('Concatenate_Xy', 'Xy')},
 'use_for': ['fit'],
 },

* Custom steps need succint declarations with very little code

* Graph description is nice to read, in my humble opinion.

* Optional 'fit' and/or 'run' roles

* TO-DO: Using memory option to cache and making it compatible with
gridSearchCv. I was too busy playing with template methods in order to
simplify its use.

I have convinced some nice colleagues at my university to team up with me
and write some nice documentation

Best wishes
Manolo


import pandas as pd
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.linear_model import LinearRegression

# work in progress library: https://github.com/mcasl/PAELLA/
from pipeGraph import (PipeGraph,
   FirstStep,
   LastStep,
   CustomStep)

from paella import Paella

URL = "
https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_percent_noise.csv
"
data = pd.read_csv(URL, usecols=['V1', 'V2'])
X, y = data[['V1']], data[['V2']]

class CustomConcatenationStep(CustomStep):
def _post_fit(self):
self.output['Xy'] = pd.concat(self.input, axis=1)


class CustomCombinationStep(CustomStep):
def _post_fit(self):
self.output['classification'] = np.where(self.input['dominant'] <
0, self.input['dominant'],
 self.input['other'])
class CustomPaellaStep(CustomStep):
def _pre_fit(self):
self.sklearn_object = Paella(**self.kargs)

def _fit(self):
self.sklearn_object.fit(**self.input)

def _post_fit(self):
self.output['prediction'] =
self.sklearn_object.transform(self.input['X'], self.input['y'])



graph_description = {
'First':
{'step': FirstStep,
 'connections': {'X': X,
 'y': y},
 'use_for': ['fit', 'run'],
 },

'Concatenate_Xy':
{'step': CustomConcatenationStep,
 'connections': {'df1': ('First', 'X'),
 'df2': ('First', 'y')},
 'use_for': ['fit'],
 },

'Gaussian_Mixture':
{'step': GaussianMixture,
 'kargs': {'n_components': 3},
 'connections': {'X': ('Concatenate_Xy', 'Xy')},
 'use_for': ['fit'],
 },

'Dbscan':
{'step': DBSCAN,
 'kargs': {'eps': 0.05},
 'connections': {'X': ('Concatenate_Xy', 'Xy')},
 'use_for': ['fit'],
 },

'Combine_Clustering':
{'step': CustomCombinationStep,
 'connections': {'dominant': ('Dbscan', 'prediction'),
 'other': ('Gaussian_Mixture', 'prediction')},
 'use_for': ['fit'],
 },

'Paella':
{'step': CustomPaellaStep,
 'kargs': {'noise_label': -1,
   'max_it': 20,
   'regular_size': 400,
   'minimum_size': 100,
   'width_r': 0.99,
   'n_neighbors': 5,
   'power': 30,
   'random_state': None},

 'connections': {'X': ('First', 'X'),
 'y': ('First', 'y'),
 'classification': ('Combine_Clustering',
'classification')},
 'use_for': ['fit'],
 },

'Regressor':
{'step': LinearRegression,
 'kargs': {},
 'connections': {'X': ('First', 'X'),
 'y': ('First', 'y'),
 'sample_weight'

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2018-01-03 Thread Manuel Castejón Limas
I've read about Dask and it is a tool I want to have in my belt especially
for using the SGE connection in order to run GridSearchCV on the
supercomputer center I have access to. Should it work as promised it will
be one of my favs.

As far as my toy example I keep more limited goals with this graph: I am
not currently interested in parallelizing each step as I guess that
parallelizing each graph fit through gridSearchCV will be more similar to
what I need.

I keep working on a proof concept. You can have a look at:

https://github.com/mcasl/PAELLA/blob/master/pipeGraph.py

along with a few unitary tests:
https://github.com/mcasl/PAELLA/blob/master/tests/test_pipeGraph.py

As of today, I have an iterable graph of steps that can be fitted/run
depending on their role (some can be disable during run while active during
fit or vice-versa). I still have to play a bit with injecting different
parameters to make it compatible with gridSearchCV and learn a bit about
the memory options in order to cache results.

Any comments highly appreciated, truly!
Manolo




2017-12-30 15:34 GMT+01:00 Frédéric Bastien :

> This start to look as the dask project. Do you know it?
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-26 Thread Manuel Castejón Limas
I'm elaborating on the graph idea. A dictionary to describe the graph, the
networkx package to support the graph and run it in topological order; and
some wrappers for scikit-learn models.

I'm currently thinking on putting some more efforts into a contrib project.

It could be something inspired by this example.

Manolo

#-



graph_description = {
  'First':
  {'operation': First_Step,
   'input': {'X':X, 'y':y}},

  'Concatenate_Xy':
  {'operation': ConcatenateData_Step,
   'input': [('First', 'X'),
 ('First', 'y')]},

  'Gaussian_Mixture':
  {'operation': Gaussian_Mixture_Step,
   'input': [('Concatenate_Xy', 'data')]},

  'Dbscan':
  {'operation': Dbscan_Step,
   'input': [('Concatenate_Xy', 'data')]},

  'CombineClustering':
  {'operation': CombineClustering_Step,
   'input': [('Dbscan', 'classification'),
 ('Gaussian_Mixture', 'classification')]},

  'Paella':
  {'operation': Paella_Step,
   'input': [('First', 'X'),
 ('First', 'y'),
 ('Concatenate_Xy', 'data'),
 ('CombineClustering', 'classification')]},

  'Regressor':
  {'operation': Regressor_Step,
   'input': [('First', 'X'),
 ('First', 'y'),
 ('Paella', 'sample_weight')]},

  'Last':
  {'operation': Last_Step,
   'input': [('Regressor', 'regressor')]},

 }

#%%
def create_graph(description):
cg = nx.DiGraph()
cg.add_nodes_from(description)
for current_name, info in description.items():
current_node = cg.node[current_name]
current_node['operation'] = info['operation']( graph = cg,
node_name = current_name )
current_node['input'] = info['input']
if current_name != 'First':
for ascendant in set( name for name, attribute in info['input']
):
cg.add_edge(ascendant, current_name)

return cg
#%%
cg = create_graph(graph_description)

node_pos = {'First'    : ( 0, 0),
'Concatenate_Xy'   : ( 2, 4),
'Gaussian_Mixture' : ( 6, 8),
'Dbscan'   : ( 6, 6),
'CombineClustering': ( 8, 7),
'Paella'   : (10, 2),
'Regressor': (12, 0),
'Last' : (16, 0)
}

nx.draw(cg, pos=node_pos, with_labels=True)

#%%

print("=")
for name in nx.topological_sort(cg):
print("Running: ", name)
cg.node[name]['operation'].fit()

print("=")







2017-12-22 12:09 GMT+01:00 Manuel Castejón Limas 
:

> I'm currently thinking on a computational graph which can then be wrapped
> as a pipeline like object ... I'll try yo make a toy example solving my
> problem.
>
> El 20 dic. 2017 16:33, "Manuel Castejón Limas" 
> escribió:
>
>> Thank you all for your interest!
>>
>> In order to clarify the case allow me to try to synthesize the spirit of
>> what I'd like to put into the pipeline using this sequence of steps:
>>
>> #%%
>> import pandas as pd
>> import numpy as np
>> import matplotlib.pyplot as plt
>>
>> from sklearn.cluster import DBSCAN
>> from sklearn.mixture import GaussianMixture
>> from sklearn.model_selection import train_test_split
>>
>> np.random.seed(seed=42)
>>
>> """
>> Data preparation
>> """
>>
>> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/
>> sin_60_percent_noise.csv"
>> data = pd.read_csv(URL, usecols=['V1','V2'])
>> X, y = data[['V1']], data[['V2']]
>>
>> 

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-22 Thread Manuel Castejón Limas
I'm currently thinking on a computational graph which can then be wrapped
as a pipeline like object ... I'll try yo make a toy example solving my
problem.

El 20 dic. 2017 16:33, "Manuel Castejón Limas" 
escribió:

> Thank you all for your interest!
>
> In order to clarify the case allow me to try to synthesize the spirit of
> what I'd like to put into the pipeline using this sequence of steps:
>
> #%%
> import pandas as pd
> import numpy as np
> import matplotlib.pyplot as plt
>
> from sklearn.cluster import DBSCAN
> from sklearn.mixture import GaussianMixture
> from sklearn.model_selection import train_test_split
>
> np.random.seed(seed=42)
>
> """
> Data preparation
> """
>
> URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/
> sin_60_percent_noise.csv"
> data = pd.read_csv(URL, usecols=['V1','V2'])
> X, y = data[['V1']], data[['V2']]
>
> (data_train, data_test,
>  X_train, X_test,
>  y_train, y_test) = train_test_split(data, X, y)
>
> """
> Parameters setup
> """
>
> dbscan__eps = 0.06
>
> mclust__n_components = 3
>
> paella__noise_label = -1
> paella__max_it = 20,
> paella__regular_size = 400,
> paella__minimum_size = 100,
> paella__width_r = 0.99,
> paella__n_neighbors = 5,
> paella__power = 30,
> paella__random_state = None
>
> #%%
> """
> DBSCAN clustering to detect noise suspects (label == -1)
> """
>
> dbscan_input = data_train
>
> dbscan_clustering = DBSCAN(eps = dbscan__eps)
>
> dbscan_output = dbscan_clustering.fit_predict(dbscan_input)
>
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=np.int64(dbscan_output == -1))
>
> #%%
> """
> GaussianMixture fitted with filtered data_train in order to help locate
> the ellipsoids
> but predict is applied to the whole data_train set.
> """
>
> mclust_input = data_train[ dbscan_output != 1]
>
> mclust_clustering = GaussianMixture(n_components = mclust__n_components)
> mclust_clustering.fit(mclust_input)
>
> mclust_output = mclust_clustering.predict(data_train)
>
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=mclust_output)
>
> #%%
> """
> mclust and dbscan results are combined.
> """
>
> clustering_output = mclust_output.copy()
> clustering_output[dbscan_output == -1] =  -1
>
> plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
> c=clustering_output)
>
> #%%
> """
> Old-good Paella paper: https://link.springer.
> com/article/10.1023/B:DAMI.031630.50685.7c
>
> The Paella algorithm calculates sample_weight to be used by the final step
> regressor
> (Yes, it is an outlier detection algorithm but we are focusing now on this
> interesting collateral result). I am currently aggressively changing the
> code in order to make it fit somehow with the pipeline
> """
>
> from paella import Paella
>
> paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)
>
> paella_run = Paella(noise_label = paella__noise_label,
> max_it = paella__max_it,
> regular_size = paella__regular_size,
> minimum_size = paella__minimum_size,
> width_r = paella__width_r,
> n_neighbors = paella__n_neighbors,
> power = paella__power,
> random_state = paella__random_state)
>
> paella_output = paella_run.fit_predict(paella_input, y_train)
> # paella_output is a vector with sample_weight
>
> #%%
> """
> Here we fit a regressor using sample_weight=paella_output
> """
> from sklearn.linear_model import LinearRegression
>
> regressor_input=X_train
> lm = LinearRegression()
> lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
> regressor_output = lm.predict(X_train)
>
> #...
>
> In this example we can see that:
> - A particular step might need results produced not necessarily from the
> immediately previous step.
> - The X parameter is not sequentially transformed. Sometimes we might need
> to skip to a previous step
> - y sometimes is the target, sometimes is not. For the regressor it is
> indeed, but for the paella algorithm the prediction is expressed as a
> vector representing sample_weights.
>
> All in all the conclusion is that the chain of proce

Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-20 Thread Manuel Castejón Limas
Thank you all for your interest!

In order to clarify the case allow me to try to synthesize the spirit of
what I'd like to put into the pipeline using this sequence of steps:

#%%
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.model_selection import train_test_split

np.random.seed(seed=42)

"""
Data preparation
"""

URL = "https://raw.githubusercontent.com/mcasl/PAELLA/master/data/sin_60_
percent_noise.csv"
data = pd.read_csv(URL, usecols=['V1','V2'])
X, y = data[['V1']], data[['V2']]

(data_train, data_test,
 X_train, X_test,
 y_train, y_test) = train_test_split(data, X, y)

"""
Parameters setup
"""

dbscan__eps = 0.06

mclust__n_components = 3

paella__noise_label = -1
paella__max_it = 20,
paella__regular_size = 400,
paella__minimum_size = 100,
paella__width_r = 0.99,
paella__n_neighbors = 5,
paella__power = 30,
paella__random_state = None

#%%
"""
DBSCAN clustering to detect noise suspects (label == -1)
"""

dbscan_input = data_train

dbscan_clustering = DBSCAN(eps = dbscan__eps)

dbscan_output = dbscan_clustering.fit_predict(dbscan_input)

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=np.int64(dbscan_output == -1))

#%%
"""
GaussianMixture fitted with filtered data_train in order to help locate the
ellipsoids
but predict is applied to the whole data_train set.
"""

mclust_input = data_train[ dbscan_output != 1]

mclust_clustering = GaussianMixture(n_components = mclust__n_components)
mclust_clustering.fit(mclust_input)

mclust_output = mclust_clustering.predict(data_train)

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=mclust_output)

#%%
"""
mclust and dbscan results are combined.
"""

clustering_output = mclust_output.copy()
clustering_output[dbscan_output == -1] =  -1

plt.scatter(data_train['V1'], data_train['V2'], cmap='cool', alpha=0.1,
c=clustering_output)

#%%
"""
Old-good Paella paper: https://link.springer.com/article/10.1023/B:DAMI.
031630.50685.7c

The Paella algorithm calculates sample_weight to be used by the final step
regressor
(Yes, it is an outlier detection algorithm but we are focusing now on this
interesting collateral result). I am currently aggressively changing the
code in order to make it fit somehow with the pipeline
"""

from paella import Paella

paella_input = pd.concat([data, clustering_output], axis=1, inplace=False)

paella_run = Paella(noise_label = paella__noise_label,
max_it = paella__max_it,
regular_size = paella__regular_size,
minimum_size = paella__minimum_size,
width_r = paella__width_r,
n_neighbors = paella__n_neighbors,
power = paella__power,
random_state = paella__random_state)

paella_output = paella_run.fit_predict(paella_input, y_train)
# paella_output is a vector with sample_weight

#%%
"""
Here we fit a regressor using sample_weight=paella_output
"""
from sklearn.linear_model import LinearRegression

regressor_input=X_train
lm = LinearRegression()
lm.fit(X=regressor_input, y=y_train, sample_weight=paella_output)
regressor_output = lm.predict(X_train)

#...

In this example we can see that:
- A particular step might need results produced not necessarily from the
immediately previous step.
- The X parameter is not sequentially transformed. Sometimes we might need
to skip to a previous step
- y sometimes is the target, sometimes is not. For the regressor it is
indeed, but for the paella algorithm the prediction is expressed as a
vector representing sample_weights.

All in all the conclusion is that the chain of processes is not as linear
as imposed by the current API. I guess that all these difficulties could be
solved by:
- Passing a dictionary through the different steps containing the partial
results that the following steps will need.
-  As a christmas gift :-) , a reference to the pipeline itself inserted in
that dictionary could provide access to the internal status of the previous
steps should it be needed.

Another interesting study case with similar needs would be a regressor
using a previous clustering step in order to fit one model per cluster. In
such case, the clustering results would be needed during the fitting.


Thanks for your interest!
Manolo
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Eager to learn! Diving on the code right now!

Thanks for the tip!
Manuel

2017-12-19 14:18 GMT+01:00 Guillaume Lemaître :

> I think that you could you use imbalanced-learn regarding the issue that
> you have with the y.
> You should be able to wrap your clustering inside the FunctionSampler (
> https://github.com/scikit-learn-contrib/imbalanced-learn/pull/342 - we
> are on the way to merge it)
>
> On 19 December 2017 at 13:44, Manuel Castejón Limas <
> manuel.caste...@gmail.com> wrote:
>
>> Dear all,
>>
>> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
>> able to transform anything other than X.
>>
>> My current study case would need:
>> - Transformers being able to handle both X and y, e.g. clustering X and y
>> concatenated
>> - Pipeline being able to change other params, e.g. sample_weight
>>
>> Currently, I'm augmenting X through every step with the extra information
>> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
>> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
>> can inherit and modify a descendant from Pipeline class to allow the y
>> parameter which is not ideal but I guess it is an option. The gritty part
>> comes when having to adapt every regressor at the end of the ladder in
>> order to split the extra information from the raw data in X and not being
>> able to generate more than one subproduct from each preprocessing step
>>
>> My current research involves clustering the data and using that
>> classification along with X in order to predict outliers which generates
>> sample_weight info and I would love to use that on the final regressor.
>> Currently there seems not to be another option than pasting that info on X.
>>
>> All in all, I'm stuck with this API limitation and I would love to learn
>> some tricks from you if you could enlighten me.
>>
>> Thanks in advance!
>>
>> Manuel Castejón-Limas
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
>
> --
> Guillaume Lemaitre
> INRIA Saclay - Parietal team
> Center for Data Science Paris-Saclay
> https://glemaitre.github.io/
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Wow, that seems promising. I'll read with interest the imbalance-learn code.
Thanks for the info!
Manuel


2017-12-19 14:15 GMT+01:00 Christos Aridas :

> Hey Manuel,
>
> In imbalanced-learn we have an extra type of estimators, named Samplers,
> which are able to modify X and y, at the same time, with the use of new API
> methods, sample and fit_sample.
> Also, we have adopted a modified version of scikit-learn's Pipeline class
> where we allow subsequent transformations using samplers and transformers.
> Despite the fact that the package deals with imbalanced datasets the
> aforementioned objects may help your pipeline.
>
> Cheerz,
> Chris
>
> On Tue, Dec 19, 2017 at 2:44 PM, Manuel Castejón Limas <
> manuel.caste...@gmail.com> wrote:
>
>> Dear all,
>>
>> Kudos to scikit-learn! Having said that, Pipeline is killing me not being
>> able to transform anything other than X.
>>
>> My current study case would need:
>> - Transformers being able to handle both X and y, e.g. clustering X and y
>> concatenated
>> - Pipeline being able to change other params, e.g. sample_weight
>>
>> Currently, I'm augmenting X through every step with the extra information
>> which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
>> breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
>> can inherit and modify a descendant from Pipeline class to allow the y
>> parameter which is not ideal but I guess it is an option. The gritty part
>> comes when having to adapt every regressor at the end of the ladder in
>> order to split the extra information from the raw data in X and not being
>> able to generate more than one subproduct from each preprocessing step
>>
>> My current research involves clustering the data and using that
>> classification along with X in order to predict outliers which generates
>> sample_weight info and I would love to use that on the final regressor.
>> Currently there seems not to be another option than pasting that info on X.
>>
>> All in all, I'm stuck with this API limitation and I would love to learn
>> some tricks from you if you could enlighten me.
>>
>> Thanks in advance!
>>
>> Manuel Castejón-Limas
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Any plans on generalizing Pipeline and transformers?

2017-12-19 Thread Manuel Castejón Limas
Dear all,

Kudos to scikit-learn! Having said that, Pipeline is killing me not being
able to transform anything other than X.

My current study case would need:
- Transformers being able to handle both X and y, e.g. clustering X and y
concatenated
- Pipeline being able to change other params, e.g. sample_weight

Currently, I'm augmenting X through every step with the extra information
which seems to work ok for my_pipe.fit_transform(X_train,y_train) but
breaks on my_pipe.transform(X_test) for the lack of the y parameter. Ok, I
can inherit and modify a descendant from Pipeline class to allow the y
parameter which is not ideal but I guess it is an option. The gritty part
comes when having to adapt every regressor at the end of the ladder in
order to split the extra information from the raw data in X and not being
able to generate more than one subproduct from each preprocessing step

My current research involves clustering the data and using that
classification along with X in order to predict outliers which generates
sample_weight info and I would love to use that on the final regressor.
Currently there seems not to be another option than pasting that info on X.

All in all, I'm stuck with this API limitation and I would love to learn
some tricks from you if you could enlighten me.

Thanks in advance!

Manuel Castejón-Limas
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV

2017-06-25 Thread Manuel CASTEJÓN LIMAS
Yes, I guess most users will be happy without using weights. Some will need
to use one single vector, but I am currently researching a weighting method
thus my need of evaluating multiple weight vectors.

 I understand that it seems to be a very specific issue with a simple
workaround, most likely not worthy of any programming  effort yet as there
are more important issues to address.

I guess that adding a note on this behaviour on the documentation could be
great. If some parameters can be iterated and others are not supported
knowing it  provides a more solid ground to the user base.

I'm committed to spend a few hours studying the code. Should I be
successful  I will come again with a pull request.
I'll cross my fingers :-)
Best
Manolo



El 24 jun. 2017 20:05, "Julio Antonio Soto de Vicente" 
escribió:

Joel is right.

In fact, you usually don't want to tune a lot the sample weights: you may
leave them default, set them in order to balance classes, or fix them
according to some business rule.

That said, you can always run a couple of grid searchs changing that sample
weights and compare results afterwards.

--
Julio

El 24 jun 2017, a las 15:51, Joel Nothman  escribió:

yes, trying multiple sample weightings is not supported by grid search
directly.

On 23 Jun 2017 6:36 pm, "Manuel Castejón Limas" 
wrote:

> Dear Joel,
>
> I tried and removed the square brackets and now it works as expected *for
> a single* sample_weight vector:
>
> validator = GridSearchCV(my_Regressor,
>  param_grid={'number_of_hidden_neurons': range(4, 5),
>  'epochs': [50],
> },
>  fit_params={'sample_weight':  my_sample_weights },
>  n_jobs=1,
> )
> validator.fit(x, y)
>
> The problem now is that I want to try multiple trainings with multiple
> sample_weight parameters, in the following fashion:
>
> validator = GridSearchCV(my_Regressor,
>  param_grid={'number_of_hidden_neurons': range(4, 5),
>  'epochs': [50],
>  'sample_weight':  [my_sample_weights, 
> my_sample_weights**2] ,
> },
>  fit_params={},
>  n_jobs=1,
> )
> validator.fit(x, y)
>
> But unfortunately it produces the same error again:
>
> ValueError: Found a sample_weight array with shape (1000,) for an input
> with shape (666, 1). sample_weight cannot be broadcast.
>
> I guess that the issue is that the sample__weight parameter was not
> thought to be changed during the tuning, was it?
>
>
> Thank you all for your patience and support.
> Best
> Manolo
>
>
>
>
> 2017-06-23 1:17 GMT+02:00 Manuel CASTEJÓN LIMAS :
>
>> Dear Joel,
>> I'm just passing an iterable as I would do with any other sequence of
>> parameters to tune. In this case the list only has one element to use but
>> in general I ought to be able to pass a collection of vectors.
>> Anyway, I guess that that issue is not the cause of the problem.
>>
>> El 23 jun. 2017 1:04 a. m., "Joel Nothman" 
>> escribió:
>>
>>> why are you passing [my_sample_weights] rather than just
>>> my_sample_weights?
>>>
>>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
> ___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV

2017-06-23 Thread Manuel Castejón Limas
Dear Joel,

I tried and removed the square brackets and now it works as expected *for a
single* sample_weight vector:

validator = GridSearchCV(my_Regressor,
 param_grid={'number_of_hidden_neurons': range(4, 5),
 'epochs': [50],
},
 fit_params={'sample_weight':  my_sample_weights },
 n_jobs=1,
)
validator.fit(x, y)

The problem now is that I want to try multiple trainings with multiple
sample_weight parameters, in the following fashion:

validator = GridSearchCV(my_Regressor,
 param_grid={'number_of_hidden_neurons': range(4, 5),
 'epochs': [50],
 'sample_weight':  [my_sample_weights,
my_sample_weights**2] ,
},
 fit_params={},
 n_jobs=1,
)
validator.fit(x, y)

But unfortunately it produces the same error again:

ValueError: Found a sample_weight array with shape (1000,) for an input
with shape (666, 1). sample_weight cannot be broadcast.

I guess that the issue is that the sample__weight parameter was not thought
to be changed during the tuning, was it?


Thank you all for your patience and support.
Best
Manolo




2017-06-23 1:17 GMT+02:00 Manuel CASTEJÓN LIMAS :

> Dear Joel,
> I'm just passing an iterable as I would do with any other sequence of
> parameters to tune. In this case the list only has one element to use but
> in general I ought to be able to pass a collection of vectors.
> Anyway, I guess that that issue is not the cause of the problem.
>
> El 23 jun. 2017 1:04 a. m., "Joel Nothman" 
> escribió:
>
>> why are you passing [my_sample_weights] rather than just
>> my_sample_weights?
>>
>>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV

2017-06-22 Thread Manuel CASTEJÓN LIMAS
Dear Joel,
I'm just passing an iterable as I would do with any other sequence of
parameters to tune. In this case the list only has one element to use but
in general I ought to be able to pass a collection of vectors.
Anyway, I guess that that issue is not the cause of the problem.

El 23 jun. 2017 1:04 a. m., "Joel Nothman" 
escribió:

> why are you passing [my_sample_weights] rather than just my_sample_weights?
>
> On 23 Jun 2017 7:49 am, "Julio Antonio Soto de Vicente" 
> wrote:
>
>> Hi Manuel,
>>
>> Are you sure that you are using the latest version (or at least >0.17)?
>> The code for splitting the sample weights in GridSearchCV has been there
>> for a while now...
>>
>> --
>> Julio
>>
>> El 22 jun 2017, a las 23:33, Manuel Castejón Limas <
>> manuel.caste...@gmail.com> escribió:
>>
>> Dear all,
>> I posted the full question on StackOverflow and as it contains some
>> figures I refer you to that post.
>>
>> https://stackoverflow.com/questions/44661926/sample-weight-p
>> arameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285
>>
>> I currently  believe that this issue is a bug and I opened an issue on
>> GitHub.
>>
>> To sum up, the issue is that GridSearchCV does not handle the splitting
>> of the sample_weight vector during cross validation.
>>
>> Nota bene: cross_val_score seems to handle the splitting OK, this issue
>> seems to occurr only in GridSearchCV.
>>
>> Any comments enlightening me and showing me how wrong I am are most
>> welcome.
>>
>>
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV

2017-06-22 Thread Manuel CASTEJÓN LIMAS
Hello Antonio,

Sure:

import sklearn
print(sklearn.__version__)
0.18.1


The error suggests that the fit function is expecting a split vector with
size 2/3*1000 but the whole vector (size 1000) is passed.

...
ValueError: Found a sample_weight array with shape (1000,) for an
input with shape (666, 1). sample_weight cannot be broadcast.



El 22 jun. 2017 11:49 p. m., "Julio Antonio Soto de Vicente" 
escribió:

Hi Manuel,

Are you sure that you are using the latest version (or at least >0.17)? The
code for splitting the sample weights in GridSearchCV has been there for a
while now...

--
Julio

El 22 jun 2017, a las 23:33, Manuel Castejón Limas <
manuel.caste...@gmail.com> escribió:

Dear all,
I posted the full question on StackOverflow and as it contains some figures
I refer you to that post.

https://stackoverflow.com/questions/44661926/sample-weight-
parameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285

I currently  believe that this issue is a bug and I opened an issue on
GitHub.

To sum up, the issue is that GridSearchCV does not handle the splitting of
the sample_weight vector during cross validation.

Nota bene: cross_val_score seems to handle the splitting OK, this issue
seems to occurr only in GridSearchCV.

Any comments enlightening me and showing me how wrong I am are most welcome.




___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Fwd: sample_weight parameter is not split when used in GridSearchCV

2017-06-22 Thread Manuel Castejón Limas
Dear all,
I posted the full question on StackOverflow and as it contains some figures
I refer you to that post.

https://stackoverflow.com/questions/44661926/sample-
weight-parameter-shape-error-in-scikit-learn-gridsearchcv/44662285#44662285

I currently  believe that this issue is a bug and I opened an issue on
GitHub.

To sum up, the issue is that GridSearchCV does not handle the splitting of
the sample_weight vector during cross validation.

Nota bene: cross_val_score seems to handle the splitting OK, this issue
seems to occurr only in GridSearchCV.

Any comments enlightening me and showing me how wrong I am are most welcome.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn