Re: [scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Andreas Mueller
Congratulations guys! Great work! Looking forward to much more! Proud to have you on the team! Now we in NYC can approve our own pull requests ;) Sent from phone. Please excuse spelling and brevity. On Wed, Apr 3, 2019, 21:08 Hanmin Qin wrote: > Congratulations and welcome to the team! > >

Re: [scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Hanmin Qin
Congratulations and welcome to the team! Hanmin Qin - Original Message - From: Joel Nothman To: Scikit-learn user and developer mailing list Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug Date: 2019-04-04 07:52 The core developers of Scikit-learn have recently

[scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Joel Nothman
The core developers of Scikit-learn have recently voted to welcome Thomas Fan and Nicolas Hug to the team, in recognition of their efforts and trustworthiness as contributors. Both happen to be working with Andy Mueller at Columbia University at the moment. Congratulations and thanks to them both!

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Eric Ma
This is not a strongly-held suggestion - but what about adopting YellowBrick as the plotting API for sklearn? Not sure how exactly the interaction would work - could be PRs to their library, or ask them to integrate into sklearn, or do a lock-step dance with versions but maintain separate teams?

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Joel Nothman
Pull requests improving the documentation are always welcome. At a minimum, users need to know that these compute different things. Accuracy is not precision. Precision is the number of true positives divided by the number of true positives plus false positives. It therefore cannot be decomposed

[scikit-learn] How to answer questions from big documents?

2019-04-03 Thread Rodrigo Rosenfeld Rosas
Hi everyone, this is my first post here :) About two weeks ago, due to the low demand in my project, I have been assigned a completely unusual request: to automatically extract answers from documents based on machine learning. I've never read anything about ML, AI or NLP before, so I've been

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Boris Hollas
Am 03.04.19 um 13:59 schrieb Joel Nothman: The equations in Murphy and Hastie very clearly assume a metric decomposable over samples (a loss function). Several popular metrics are not. For a metric like MSE it will be almost identical assuming the test sets have almost the same size. What will

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Joel Nothman
With option 1, sklearn.plot is likely to import large chunks of the library (particularly, but not exclusively, if the plotting function "does the work" as Andy suggests). This is under the assumption that one plot function will want to import trees, another GPs, etc. Unless we move to lazy

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Gael Varoquaux
On Wed, Apr 03, 2019 at 08:54:51AM -0400, Andreas Mueller wrote: > If the loss decomposes, the result might be different b/c of different test > set sizes, but I'm not sure if they are "worse" in some way? Mathematically, a cross-validation estimates a double expectation: one expectation on the

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Andreas Mueller
I think what was not clear from the question is that there is actually quite different kinds of plotting functions, and many of these are tied to existing code. Right now we have some that are specific to trees (plot_tree) and to gradient boosting (plot_partial_dependence). I think we want

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Andreas Mueller
On 4/3/19 7:59 AM, Joel Nothman wrote: The equations in Murphy and Hastie very clearly assume a metric decomposable over samples (a loss function). Several popular metrics are not. For a metric like MSE it will be almost identical assuming the test sets have almost the same size. For

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Joel Nothman
The equations in Murphy and Hastie very clearly assume a metric decomposable over samples (a loss function). Several popular metrics are not. For a metric like MSE it will be almost identical assuming the test sets have almost the same size. For something like Recall (sensitivity) it will be

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Roman Yurchak via scikit-learn
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting functions will be added? If it's just a dozen or less, putting them all into a single namespace sklearn.plot might be easier. This also would avoid discussion about where to put some generic plotting functions (e.g.

[scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Boris Hollas
I use sum((cross_val_predict(model, X, y) - y)**2) / len(y)        (*) to evaluate the performance of a model. This conforms with Murphy: Machine Learning, section 6.5.3, and Hastie et al: The Elements of Statistical Learning,  eq. 7.48. However, according to the documentation of

Re: [scikit-learn] Can cluster help me to cluster data with length of continuous series?

2019-04-03 Thread Christian Braune
Hi, that does not really sound like a clustering but more like a preprocessing problem to me. For each item you want to calculate the length of the longest subsequence of "1"s. That could be done by a simple function and would create a new (one-dimensional) property for each of your items. You

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Trevor Stephens
I think #1 if any of these... Plotting functions should hopefully be as general as possible, so tagging with a specific type of estimator will, in some scikit-learn utopia, be unnecessary. If a general plotter is built, where does it live in other estimator-specific namespace options? Feels

Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Andrew Howe
My preference would be for (1). I don't think the sub-namespace in (2) is necessary, and don't like (3), as I would prefer the plotting functions to be all in the same namespace sklearn.plot. Andrew <~~~> J. Andrew Howe, PhD LinkedIn Profile

[scikit-learn] Can cluster help me to cluster data with length of continuous series?

2019-04-03 Thread lampahome
I have data which contain access duration of each items. EX: t0~t4 is the access time duration. 1 means the item was accessed in the time duration, 0 means not. ID,t0,t1,t2,t3,t4 0,1,0,0,1 1,1,0,0,1 2,0,0,1,1 3,0,1,1,1 What I want to cluster is the length of continuous duration Ex: ID=3 > 2 > 1