Re: [scikit-learn] Pipegraph example: KMeans + LDA
On 10/29/18 8:08 AM, Manuel Castejón Limas wrote: The long story short: Thank you for your time & sorry for inaccuracies; a few words selling a modular approach to your developments; and a request on your opinion on parallelizing Pipegraph using dask. I'm not very experienced with dask, so I'm probably not the right person to help you. And I totally get that pipegraph is more flexible than whatever hack I came up with :) In the mean-time microsoft launched nimbusml: https://docs.microsoft.com/en-us/nimbusml/overview It actually implements something very similar to pipegraph on top of ML.net FYI And I also gave the MS people a hard time when discussing their pipeline object ;) I'm still not entirely convinced this is necessary, but for NimbusML, the underlying library is built with the DAG in mind. So different algorithms have different output slots that you can tab into, while sklearn basically "only" has transform and predict (and predict proba). ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Pipegraph example: KMeans + LDA
On 10/29/18 8:08 AM, Manuel Castejón Limas wrote: The long story short: Thank you for your time & sorry for inaccuracies; a few words selling a modular approach to your developments; and a request on your opinion on parallelizing Pipegraph using dask. I'm not very experienced with dask, so I'm probably not the right person to help you. And I totally get that pipegraph is more flexible than whatever hack I came up with :) In the mean-time microsoft launched nimbusml: https://docs.microsoft.com/en-us/nimbusml/overview It actually implements something very similar to pipegraph on top of ML.net FYI And I also gave the MS people a hard time when discussing their pipeline object ;) I'm still not entirely convinced this is necessary, but for NimbusML, the underlying library is built with the DAG in mind. So different algorithms have different output slots that you can tab into, while sklearn basically "only" has transform and predict (and predict proba). ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Pipegraph example: KMeans + LDA
The long story short: Thank you for your time & sorry for inaccuracies; a few words selling a modular approach to your developments; and a request on your opinion on parallelizing Pipegraph using dask. Thank you Andreas for your patience showing me the sklearn ways. I admit that I'm still learning scikit-learn capabilities which is a tough thing as you all continue improving the library as in this new release. Keep up the good work with your developments and your teaching to the community. In particular, I learned A LOT with your answer. Big thanks! I'm inlining my comments: () > - KMeans is not a transformer but an estimator > > KMeans is a transformer in sklearn: > http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform > > (you can't get the labels to be the output which is what you're doing > here, but it is a transformer) > My bad! I saw the predict method and did not check the source code: It is true, from the code class KMeans(BaseEstimator, ClusterMixin, TransformerMixin): The point was that, as you guessed, one cannot put it in a pipeline a KMeans followed by a LDA just like that without additional efforts. > - LDA score function requires the y parameter, while its input does not > come from a known set of labels, but from the previous KMeans > - Moreover, the GridSearchCV.fit call would also require a 'y' parameter > > Not true if you provide a scoring that doesn't require y or if you don't > specify scoring and the scoring method of the estimator doesn't require y. > > GridSearchCV.fit doesn't require y. > My bad again. I wanted to mean that without the scoring function and the CV iterator that you use below, gridsearchCV will call the scoring function of the final step, i.e. LDA, and LDA scoring function wants a y. But please, bear with me, I simply did not know the proper hacks.The test does not lie, I'm quite a newbie then :-) Can you provide a citation for that? That seems to heavily depend on the > clustering algorithms and the classifier. > To me, stability scoring seems more natural: > https://arxiv.org/abs/1007.1075 > > Good to know, thank you for the reference. You are right about the dependance, it's all about the nature of the clustering and the classifier; but I was just providing a scenario, not necessarily advocating for this strategy as the solution to the number of clusters question. It's cool that this is possible, but I feel this is still not really a > "killer application" in that this is not a very common pattern. > IMHO, the beauty of the example, if there is any :-D, was the simplicity and brevity. I agree that it is not a killer application, just a possible situation. > Though I acknowledge that your code only takes 4 lines, while mine takes 8 > (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines > :P) > I think pipegraph is cool, not meaning to give you a hard time ;) > Thank you again for your time. The thing is that I believe PipeGraph can be useful for you in terms of approaching your models following a modular approach. I'm going to work on a second example implementing something similar to the VotingClassifier class to show you the approach. The main weakness is the lack of parallelism in the inner working of PipeGraph, which was never a concern for me since as far as GridSearchCV can parallelize the training I was ok with that grain size. But, now, I reckon that paralellization can be useful for you in term of approaching your models as a PipeGraph and having Parallelization for free without having to directly call joblib (thank you joblib authors for such goodie). I guess that providing a dask backend for pipegraph would be nice. But let me continue with this issue after sending the VotingClassifier example :-) Thanks, truly, I need to study hard! Manuel ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
Re: [scikit-learn] Pipegraph example: KMeans + LDA
On 10/24/18 4:11 AM, Manuel Castejón Limas wrote: Dear all, as a way of improving the documentation of PipeGraph we intend to provide more examples of its usage. It was a popular demand to show application cases to motivate its usage, so here it is a very simple case with two steps: a KMeans followed by a LDA. https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py This short example points out the following challenges: - KMeans is not a transformer but an estimator KMeans is a transformer in sklearn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.transform (you can't get the labels to be the output which is what you're doing here, but it is a transformer) - LDA score function requires the y parameter, while its input does not come from a known set of labels, but from the previous KMeans - Moreover, the GridSearchCV.fit call would also require a 'y' parameter Not true if you provide a scoring that doesn't require y or if you don't specify scoring and the scoring method of the estimator doesn't require y. GridSearchCV.fit doesn't require y. - It would be nice to have access to the output of the KMeans step as well. PipeGraph is capable of addressing these challenges. The rationale for this example lies in the identification-reconstruction realm. In a scenario where the class labels are unknown, we might want to associate the quality of the clustering structure to the capability of a later model to be able to reconstruct this structure. So the basic idea here is that if LDA is capable of getting good results it was because the information of the KMeans was good enough for that purpose, hinting the discovery of a good structure. Can you provide a citation for that? That seems to heavily depend on the clustering algorithms and the classifier. To me, stability scoring seems more natural: https://arxiv.org/abs/1007.1075 This does seem interesting as well, though, haven't thought about this. It's cool that this is possible, but I feel this is still not really a "killer application" in that this is not a very common pattern. Also you could replicate something similar in sklearn with def estimator_scorer(testing_estimator): def my_scorer(estimator, X, y=None): y = estimator.predict(X) return np.mean(cross_val_score(testing_estimator, X, y)) Though using that we'd be doing nested cross-validation on the test set... That's a bit of an issue in the current GridSearchCV implementation :-/ There's an issue by Joel somewhere to implement something that allows training without splitting which is what you'd want here. You could run the outer grid-search with a custom cross-validation iterator that returns all indices as training and test set and only does a single split, though... class NoSplitCV(object): def split(self, X, y, class_weights): indices = np.arange(_num_samples(X)) yield indices, indices Though I acknowledge that your code only takes 4 lines, while mine takes 8 (thought if we'd add NoSplitCV to sklearn mine would also only take 4 lines :P) I think pipegraph is cool, not meaning to give you a hard time ;) ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
[scikit-learn] Pipegraph example: KMeans + LDA
Dear all, as a way of improving the documentation of PipeGraph we intend to provide more examples of its usage. It was a popular demand to show application cases to motivate its usage, so here it is a very simple case with two steps: a KMeans followed by a LDA. https://mcasl.github.io/PipeGraph/auto_examples/plot_Finding_Number_of_clusters.html#sphx-glr-auto-examples-plot-finding-number-of-clusters-py This short example points out the following challenges: - KMeans is not a transformer but an estimator - LDA score function requires the y parameter, while its input does not come from a known set of labels, but from the previous KMeans - Moreover, the GridSearchCV.fit call would also require a 'y' parameter - It would be nice to have access to the output of the KMeans step as well. PipeGraph is capable of addressing these challenges. The rationale for this example lies in the identification-reconstruction realm. In a scenario where the class labels are unknown, we might want to associate the quality of the clustering structure to the capability of a later model to be able to reconstruct this structure. So the basic idea here is that if LDA is capable of getting good results it was because the information of the KMeans was good enough for that purpose, hinting the discovery of a good structure. ___ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn