Re: [scikit-learn] Pipegraph example: KMeans + LDA

2018-11-06 Thread Andreas Mueller



On 10/29/18 8:08 AM, Manuel Castejón Limas wrote:
The long story short: Thank you for your time & sorry for 
inaccuracies; a few words selling a modular approach to your 
developments; and a request on your opinion on parallelizing 
Pipegraph using dask.
I'm not very experienced with dask, so I'm probably not the right person 
to help you.
And I totally get that pipegraph is more flexible than whatever hack I 
came up with :)


In the mean-time microsoft launched nimbusml:
https://docs.microsoft.com/en-us/nimbusml/overview

It actually implements something very similar to pipegraph on top of ML.net
FYI And I also gave the MS people a hard time when discussing their 
pipeline object ;)


I'm still not entirely convinced this is necessary, but for NimbusML, 
the underlying
library is built with the DAG in mind. So different algorithms have 
different output slots
that you can tab into, while sklearn basically "only" has transform and 
predict (and predict proba).


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph example: KMeans + LDA

2018-11-06 Thread Andreas Mueller



On 10/29/18 8:08 AM, Manuel Castejón Limas wrote:
The long story short: Thank you for your time & sorry for 
inaccuracies; a few words selling a modular approach to your 
developments; and a request on your opinion on parallelizing 
Pipegraph using dask.
I'm not very experienced with dask, so I'm probably not the right person 
to help you.
And I totally get that pipegraph is more flexible than whatever hack I 
came up with :)


In the mean-time microsoft launched nimbusml:
https://docs.microsoft.com/en-us/nimbusml/overview

It actually implements something very similar to pipegraph on top of ML.net
FYI And I also gave the MS people a hard time when discussing their 
pipeline object ;)


I'm still not entirely convinced this is necessary, but for NimbusML, 
the underlying
library is built with the DAG in mind. So different algorithms have 
different output slots
that you can tab into, while sklearn basically "only" has transform and 
predict (and predict proba).


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph example: KMeans + LDA

2018-10-29 Thread Manuel Castejón Limas
The long story short: Thank you for your time & sorry for inaccuracies; a
few words selling a modular approach to your developments; and a request on
your opinion on parallelizing Pipegraph using dask.

Thank you Andreas for your patience showing me the sklearn ways. I admit
that I'm still learning scikit-learn capabilities which is a tough thing as
you all continue improving the library as in this new release. Keep up the
good work with your developments and your teaching to the community. In
particular, I learned A LOT with your answer. Big thanks!

I'm inlining my comments:

()

> - KMeans is not a transformer but an estimator
>
> KMeans is a transformer in sklearn:
> http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform
>
> (you can't get the labels to be the output which is what you're doing
> here, but it is a transformer)
>
My bad! I saw the predict method and did not check the source code: It is
true, from the code
class KMeans(BaseEstimator, ClusterMixin, TransformerMixin):
The point was that, as you guessed, one cannot put it in a pipeline a
KMeans followed by a LDA just like that without additional efforts.


> - LDA score function requires the y parameter, while its input does not
> come from a known set of labels, but from the previous KMeans
> - Moreover, the GridSearchCV.fit call would also require a 'y' parameter
>
> Not true if you provide a scoring that doesn't require y or if you don't
> specify scoring and the scoring method of the estimator doesn't require y.
>
> GridSearchCV.fit doesn't require y.
>
My bad again. I wanted to mean that without the scoring function and the CV
iterator that you use below, gridsearchCV will call the scoring function
of the final step, i.e. LDA, and LDA scoring function wants a y. But
please, bear with me, I simply did not know the proper hacks.The test does
not lie, I'm quite a newbie then :-)

Can you provide a citation for that? That seems to heavily depend on the
> clustering algorithms and the classifier.
> To me, stability scoring seems more natural:
> https://arxiv.org/abs/1007.1075
>
> Good to know, thank you for the reference. You are right about the
dependance, it's all about the nature of the clustering and the classifier;
but I was just providing a scenario, not necessarily advocating for this
strategy as the solution to the number of clusters question.

It's cool that this is possible, but I feel this is still not really a
> "killer application" in that this is not a very common pattern.
>
IMHO, the beauty of the example, if there is any :-D,  was the simplicity
and brevity. I agree that it is not a killer application, just a possible
situation.

> Though I acknowledge that your code only takes 4 lines, while mine takes 8
> (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines
> :P)
>
I think pipegraph is cool, not meaning to give you a hard time ;)
>

Thank you again for your time. The thing is that I believe PipeGraph can be
useful for you in terms of approaching your models following a modular
approach. I'm going to work on a second example implementing something
similar to the VotingClassifier class to show you the approach.
The main weakness is the lack of parallelism in the inner working of
PipeGraph, which was never a concern for me since as far as GridSearchCV
can parallelize the training I was ok with that grain size. But, now, I
reckon that paralellization can be useful for you in term of approaching
your models as a PipeGraph and having Parallelization for free without
having to directly call joblib (thank you joblib authors for such goodie).
I guess that providing a dask backend for pipegraph would be nice. But let
me continue with this issue after sending the VotingClassifier example :-)

Thanks, truly, I need to study hard!
Manuel
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Pipegraph example: KMeans + LDA

2018-10-28 Thread Andreas Mueller


On 10/24/18 4:11 AM, Manuel Castejón Limas wrote:

Dear all,
as a way of improving the documentation of PipeGraph we intend to 
provide more examples of its usage. It was a popular demand to show 
application cases to motivate its usage, so here it is a very simple 
case with two steps: a KMeans followed by a LDA.


https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py

This short example points out the following challenges:
- KMeans is not a transformer but an estimator


KMeans is a transformer in sklearn: 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform


(you can't get the labels to be the output which is what you're doing 
here, but it is a transformer)


- LDA score function requires the y parameter, while its input does 
not come from a known set of labels, but from the previous KMeans

- Moreover, the GridSearchCV.fit call would also require a 'y' parameter


Not true if you provide a scoring that doesn't require y or if you don't 
specify scoring and the scoring method of the estimator doesn't require y.


GridSearchCV.fit doesn't require y.

- It would be nice to have access to the output of the KMeans step as 
well.


PipeGraph is capable of addressing these challenges.

The rationale for this example lies in the 
identification-reconstruction realm. In a scenario where the class 
labels are unknown, we might want to associate the quality of the 
clustering structure to the capability of a later model to be able to 
reconstruct this structure. So the basic idea here is that if LDA is 
capable of getting good results it was because the information of the 
KMeans was good enough for that purpose, hinting the discovery of a 
good structure.


Can you provide a citation for that? That seems to heavily depend on the 
clustering algorithms and the classifier.

To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075

This does seem interesting as well, though, haven't thought about this.

It's cool that this is possible, but I feel this is still not really a 
"killer application" in that this is not a very common pattern.


Also you could replicate something similar in sklearn with

def estimator_scorer(testing_estimator):
    def my_scorer(estimator, X, y=None):
        y = estimator.predict(X)

    return np.mean(cross_val_score(testing_estimator, X, y))

Though using that we'd be doing nested cross-validation on the test set...
That's a bit of an issue in the current GridSearchCV implementation :-/ 
There's an issue by Joel somewhere
to implement something that allows training without splitting which is 
what you'd want here.
You could run the outer grid-search with a custom cross-validation 
iterator that returns all indices as training and test set and only does 
a single split, though...


class NoSplitCV(object):

    def split(self, X, y, class_weights):

    indices = np.arange(_num_samples(X))
    yield indices, indices

Though I acknowledge that your code only takes 4 lines, while mine takes 
8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4 
lines :P)


I think pipegraph is cool, not meaning to give you a hard time ;)

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Pipegraph example: KMeans + LDA

2018-10-24 Thread Manuel Castejón Limas
Dear all,
as a way of improving the documentation of PipeGraph we intend to provide
more examples of its usage. It was a popular demand to show application
cases to motivate its usage, so here it is a very simple case with two
steps: a KMeans followed by a LDA.

https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py

This short example points out the following challenges:
- KMeans is not a transformer but an estimator
- LDA score function requires the y parameter, while its input does not
come from a known set of labels, but from the previous KMeans
- Moreover, the GridSearchCV.fit call would also require a 'y' parameter
- It would be nice to have access to the output of the KMeans step as well.

PipeGraph is capable of addressing these challenges.

The rationale for this example lies in the identification-reconstruction
realm. In a scenario where the class labels are unknown, we might want to
associate the quality of the clustering structure to the capability of a
later model to be able to reconstruct this structure. So the basic idea
here is that if LDA is capable of getting good results it was because the
information of the KMeans was good enough for that purpose, hinting the
discovery of a good structure.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn