[scikit-learn] Can cluster help me to cluster data with length of continuous series?

2019-04-03 Thread lampahome
I have data which contain access duration of each items.

EX: t0~t4 is the access time duration. 1 means the item was accessed in the
time duration, 0 means not.
ID,t0,t1,t2,t3,t4
0,1,0,0,1
1,1,0,0,1
2,0,0,1,1
3,0,1,1,1

What I want to cluster is the length of continuous duration
Ex:
ID=3 > 2 > 1 = 0

Can any distance metric to help clustering based on the length of
continuous duration?
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Andrew Howe
My preference would be for (1). I don't think the sub-namespace in (2) is
necessary, and don't like (3), as I would prefer the plotting functions to
be all in the same namespace sklearn.plot.

Andrew

<~~~>
J. Andrew Howe, PhD
LinkedIn Profile 
ResearchGate Profile 
Open Researcher and Contributor ID (ORCID)

Github Profile 
Personal Website 
I live to learn, so I can learn to live. - me
<~~~>


On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin  wrote:

> See https://github.com/scikit-learn/scikit-learn/issues/13448
>
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to decide
> where to put these functions. Currently, there're 3 proposals:
>
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>
> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
> that we won't support from sklearn.XXX import plot_YYY)
>
> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
> to invite opinions.
>
> Thanks
>
> Hanmin Qin
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Trevor Stephens
I think #1 if any of these... Plotting functions should hopefully be as
general as possible, so tagging with a specific type of estimator will, in
some scikit-learn utopia, be unnecessary.

If a general plotter is built, where does it live in other
estimator-specific namespace options? Feels awkward to put it under every
estimator's namespace.

Then again, there might be a #4 where there is no plot module and plotting
classes live under groups of utilities like introspection, cross-validation
or something?...

On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe  wrote:

> My preference would be for (1). I don't think the sub-namespace in (2) is
> necessary, and don't like (3), as I would prefer the plotting functions to
> be all in the same namespace sklearn.plot.
>
> Andrew
>
> <~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile 
> ResearchGate Profile 
> Open Researcher and Contributor ID (ORCID)
> 
> Github Profile 
> Personal Website 
> I live to learn, so I can learn to live. - me
> <~~~>
>
>
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin  wrote:
>
>> See https://github.com/scikit-learn/scikit-learn/issues/13448
>>
>> We've introduced several plotting functions (e.g., plot_tree and
>> plot_partial_dependence) and will introduce more (e.g.,
>> plot_decision_boundary) in the future. Consequently, we need to decide
>> where to put these functions. Currently, there're 3 proposals:
>>
>> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
>>
>> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
>>
>> (3) sklearn.XXX.plot.plot_YYY (e.g., sklearn.tree.plot.plot_tree, note
>> that we won't support from sklearn.XXX import plot_YYY)
>>
>> Joel Nothman, Gael Varoquaux and I decided to post it on the mailing list
>> to invite opinions.
>>
>> Thanks
>>
>> Hanmin Qin
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Can cluster help me to cluster data with length of continuous series?

2019-04-03 Thread Christian Braune
Hi,

that does not really sound like a clustering but more like a preprocessing
problem to me. For each item you want to calculate the length of the
longest subsequence of "1"s. That could be done by a simple function and
would create a new (one-dimensional) property for each of your items.
You could then apply any clustering algorithm to this feature (i.e. you'd
be clustering a one-dimensional dataset)...

Regards,
  Christian

lampahome  schrieb am Mi., 3. Apr. 2019 um
11:08 Uhr:

> I have data which contain access duration of each items.
>
> EX: t0~t4 is the access time duration. 1 means the item was accessed in
> the time duration, 0 means not.
> ID,t0,t1,t2,t3,t4
> 0,1,0,0,1
> 1,1,0,0,1
> 2,0,0,1,1
> 3,0,1,1,1
>
> What I want to cluster is the length of continuous duration
> Ex:
> ID=3 > 2 > 1 = 0
>
> Can any distance metric to help clustering based on the length of
> continuous duration?
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Boris Hollas

I use

sum((cross_val_predict(model, X, y) - y)**2) / len(y)        (*)

to evaluate the performance of a model. This conforms with Murphy: 
Machine Learning, section 6.5.3, and Hastie et al: The Elements of 
Statistical Learning,  eq. 7.48. However, according to the documentation 
of cross_val_predict, "it is not appropriate to pass these predictions 
into an evaluation metric". While it is obvious that cross_val_predict 
is different from cross_val_score, I don't see what should be wrong with 
(*).


Also, the explanation that "|cross_val_predict| 
simply 
returns the labels (or probabilities)" is unclear, if not wrong. As I 
understand it, this function returns estimates and no labels or 
probabilities.


Regards, Boris

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] LASSO: Predicted values show negative correlation with observed values on random data

2019-04-03 Thread Martin Watzenboeck
Hi Alex,

Thanks a lot for the answer! That does indeed explain this phenomenon.
Also, I know see that with my data I can get meaningful LASSO predictions
by tuning the alpha parameter.

Cheers,
Martin

Am Di., 2. Apr. 2019 um 21:33 Uhr schrieb Alexandre Gramfort <
alexandre.gramf...@inria.fr>:

> in your example with random data Lasso leads to coef_ of zeros so you get
> as prediction : np.mean(Y[train])
>
> you'll see the same phenomenon if you do:
>
> pred = np.r_[pred, np.mean(Y[train])]
>
> Alex
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Roman Yurchak via scikit-learn
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting 
functions will be added? If it's just a dozen or less, putting them all 
into a single namespace sklearn.plot might be easier.

This also would avoid discussion about where to put some generic 
plotting functions (e.g. 
https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).

Roman

On 03/04/2019 12:06, Trevor Stephens wrote:
> I think #1 if any of these... Plotting functions should hopefully be as 
> general as possible, so tagging with a specific type of estimator will, 
> in some scikit-learn utopia, be unnecessary.
> 
> If a general plotter is built, where does it live in other 
> estimator-specific namespace options? Feels awkward to put it under 
> every estimator's namespace.
> 
> Then again, there might be a #4 where there is no plot module and 
> plotting classes live under groups of utilities like introspection, 
> cross-validation or something?...
> 
> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe  > wrote:
> 
> My preference would be for (1). I don't think the sub-namespace in
> (2) is necessary, and don't like (3), as I would prefer the plotting
> functions to be all in the same namespace sklearn.plot.
> 
> Andrew
> 
> <~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile 
> ResearchGate Profile 
> Open Researcher and Contributor ID (ORCID)
> 
> Github Profile 
> Personal Website 
> I live to learn, so I can learn to live. - me
> <~~~>
> 
> 
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin  > wrote:
> 
> See https://github.com/scikit-learn/scikit-learn/issues/13448
> 
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to
> decide where to put these functions. Currently, there're 3
> proposals:
> 
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> 
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
> 
> (3) sklearn.XXX.plot.plot_YYY (e.g.,
> sklearn.tree.plot.plot_tree, note that we won't support from
> sklearn.XXX import plot_YYY)
> 
> Joel Nothman, Gael Varoquaux and I decided to post it on the
> mailing list to invite opinions.
> 
> Thanks
> 
> Hanmin Qin
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Joel Nothman
The equations in Murphy and Hastie very clearly assume a metric
decomposable over samples (a loss function). Several popular metrics
are not.

For a metric like MSE it will be almost identical assuming the test
sets have almost the same size. For something like Recall
(sensitivity) it will be almost identical assuming similar test set
sizes *and* stratification. For something like precision whose
denominator is determined by the biases of the learnt classifier on
the test dataset, you can't say the same. For something like ROC AUC
score, relying on some decision function that may not be equivalently
calibrated across splits, evaluating in this way is almost
meaningless.

On Wed, 3 Apr 2019 at 22:01, Boris Hollas
 wrote:
>
> I use
>
> sum((cross_val_predict(model, X, y) - y)**2) / len(y)(*)
>
> to evaluate the performance of a model. This conforms with Murphy: Machine 
> Learning, section 6.5.3, and Hastie et al: The Elements of Statistical 
> Learning,  eq. 7.48. However, according to the documentation of 
> cross_val_predict, "it is not appropriate to pass these predictions into an 
> evaluation metric". While it is obvious that cross_val_predict is different 
> from cross_val_score, I don't see what should be wrong with (*).
>
> Also, the explanation that "cross_val_predict simply returns the labels (or 
> probabilities)" is unclear, if not wrong. As I understand it, this function 
> returns estimates and no labels or probabilities.
>
> Regards, Boris
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Andreas Mueller



On 4/3/19 7:59 AM, Joel Nothman wrote:

The equations in Murphy and Hastie very clearly assume a metric
decomposable over samples (a loss function). Several popular metrics
are not.

For a metric like MSE it will be almost identical assuming the test
sets have almost the same size. For something like Recall
(sensitivity) it will be almost identical assuming similar test set
sizes *and* stratification. For something like precision whose
denominator is determined by the biases of the learnt classifier on
the test dataset, you can't say the same. For something like ROC AUC
score, relying on some decision function that may not be equivalently
calibrated across splits, evaluating in this way is almost
meaningless.


In theory. Not sure how it holds up in practice.

I didn't get the point about precision.

But yes, we should add to the docs that in particular for losses that 
don't decompose this is a weird thing to do.


If the loss decomposes, the result might be different b/c of different 
test set sizes, but I'm not sure if they are "worse" in some way?


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Andreas Mueller
I think what was not clear from the question is that there is actually 
quite different kinds of plotting functions, and many of these are tied 
to existing code.


Right now we have some that are specific to trees (plot_tree) and to 
gradient boosting (plot_partial_dependence).


I think we want more general functions, and plot_partial_dependence has 
been extended to general estimators.


However, the plotting functions might be generic wrt the estimator, but 
relate to a specific function, say plotting results of GridSearchCV.
Then one might argue that having the plotting function close to 
GridSearchCV might make sense.
Similarly for plotting partial dependence plots and feature importances, 
it might be a bit strange to have the plotting functions not next to the 
functions that compute these.
Another question would be is whether the plotting functions also "do the 
work" in some cases:
Do we want plot_partial_dependence also to compute the partial 
dependence? (I would argue yes but either way the result is a bit strange).
In that case you have somewhat of the same functionality in two 
different modules, unless you also put the "compute partial dependence" 
function in the plotting module as well,

which is a bit strange.

Maybe we could inform this discussion by listing candidate plotting 
functions, and also considering whether they "do the work" and where the 
"work" function is.


Other examples are plotting the confusion matrix, which probably should 
also compute the confusion matrix (it's fast and so that would be 
convenient), and so it would "duplicate" functionality from the metrics 
module.


Plotting learning curves and validation curves should probably not do 
the work as it's pretty involved, and so someone would need to import 
the learning and validation curves from model selection, and then the 
plotting functions from a plotting module.


Calibrations curves and P/R curves and roc curves are also pretty fast 
to compute (and passing around the arguments is somewhat error prone) so 
I would say the plotting functions for these should do the work as well.


Anyway, you can see that many plotting functions are actually associated 
with functions in existing modules and the interactions are a bit unclear.


The only plotting functions I haven't mentioned so far that I thought 
about in the past are "2d scatter" and "plot decision function". These 
would be kind of generic, but mostly used in the examples.
Though having a discrete 2d scatter function would be pretty nice 
(plt.scatter doesn't allow legends and makes it hard to use qualitative 
color maps).



I think I would vote for option (1), "sklearn.plot.plot_zzz" but the 
case is not really that clear.


Cheers,

Andy

On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:

+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
functions will be added? If it's just a dozen or less, putting them all
into a single namespace sklearn.plot might be easier.

This also would avoid discussion about where to put some generic
plotting functions (e.g.
https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).

Roman

On 03/04/2019 12:06, Trevor Stephens wrote:

I think #1 if any of these... Plotting functions should hopefully be as
general as possible, so tagging with a specific type of estimator will,
in some scikit-learn utopia, be unnecessary.

If a general plotter is built, where does it live in other
estimator-specific namespace options? Feels awkward to put it under
every estimator's namespace.

Then again, there might be a #4 where there is no plot module and
plotting classes live under groups of utilities like introspection,
cross-validation or something?...

On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe mailto:ahow...@gmail.com>> wrote:

 My preference would be for (1). I don't think the sub-namespace in
 (2) is necessary, and don't like (3), as I would prefer the plotting
 functions to be all in the same namespace sklearn.plot.

 Andrew

 <~~~>
 J. Andrew Howe, PhD
 LinkedIn Profile 
 ResearchGate Profile 
 Open Researcher and Contributor ID (ORCID)
 
 Github Profile 
 Personal Website 
 I live to learn, so I can learn to live. - me
 <~~~>


 On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin mailto:qinhanmin2...@sina.com>> wrote:

 See https://github.com/scikit-learn/scikit-learn/issues/13448

 We've introduced several plotting functions (e.g., plot_tree and
 plot_partial_dependence) and will introduce more (e.g.,
 plot_decision_boundary) in the future. Consequently, we need to
 decide where to put these functions. Currently, there're 3
 proposals:

 (1) sklearn.plot.

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Gael Varoquaux
On Wed, Apr 03, 2019 at 08:54:51AM -0400, Andreas Mueller wrote:
> If the loss decomposes, the result might be different b/c of different test
> set sizes, but I'm not sure if they are "worse" in some way?

Mathematically, a cross-validation estimates a double expectation: one
expectation on the model (ie the train data), and another on the test
data (see for instance eq 3 in
https://europepmc.org/articles/pmc5441396, sorry for the self citation,
this is seldom discussed in the literature).

The correct way to compute this double expectation is by averaging first
inside the fold and second across the folds. Other ways of computing
errors estimate other quantities, that are harder to study mathematically
and not comparable to objects studied in the literature.

Another problem with cross_val_predict is that some people use metrics
like correlation (which is a terrible metric and does not decompose
across folds). It will then pick up things like correlations across
folds.

All these problems are made worse when data are not iid, and hence folds
risk not being iid.

G
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Joel Nothman
With option 1, sklearn.plot is likely to import large chunks of the
library (particularly, but not exclusively, if the plotting function
"does the work" as Andy suggests). This is under the assumption that
one plot function will want to import trees, another GPs, etc. Unless
we move to lazy imports, that would be against the current convention
that importing sklearn is fairly minimal.

I do like Andy's idea of framing this discussion more clearly around
likely candidates.

On Thu, 4 Apr 2019 at 00:10, Andreas Mueller  wrote:
>
> I think what was not clear from the question is that there is actually
> quite different kinds of plotting functions, and many of these are tied
> to existing code.
>
> Right now we have some that are specific to trees (plot_tree) and to
> gradient boosting (plot_partial_dependence).
>
> I think we want more general functions, and plot_partial_dependence has
> been extended to general estimators.
>
> However, the plotting functions might be generic wrt the estimator, but
> relate to a specific function, say plotting results of GridSearchCV.
> Then one might argue that having the plotting function close to
> GridSearchCV might make sense.
> Similarly for plotting partial dependence plots and feature importances,
> it might be a bit strange to have the plotting functions not next to the
> functions that compute these.
> Another question would be is whether the plotting functions also "do the
> work" in some cases:
> Do we want plot_partial_dependence also to compute the partial
> dependence? (I would argue yes but either way the result is a bit strange).
> In that case you have somewhat of the same functionality in two
> different modules, unless you also put the "compute partial dependence"
> function in the plotting module as well,
> which is a bit strange.
>
> Maybe we could inform this discussion by listing candidate plotting
> functions, and also considering whether they "do the work" and where the
> "work" function is.
>
> Other examples are plotting the confusion matrix, which probably should
> also compute the confusion matrix (it's fast and so that would be
> convenient), and so it would "duplicate" functionality from the metrics
> module.
>
> Plotting learning curves and validation curves should probably not do
> the work as it's pretty involved, and so someone would need to import
> the learning and validation curves from model selection, and then the
> plotting functions from a plotting module.
>
> Calibrations curves and P/R curves and roc curves are also pretty fast
> to compute (and passing around the arguments is somewhat error prone) so
> I would say the plotting functions for these should do the work as well.
>
> Anyway, you can see that many plotting functions are actually associated
> with functions in existing modules and the interactions are a bit unclear.
>
> The only plotting functions I haven't mentioned so far that I thought
> about in the past are "2d scatter" and "plot decision function". These
> would be kind of generic, but mostly used in the examples.
> Though having a discrete 2d scatter function would be pretty nice
> (plt.scatter doesn't allow legends and makes it hard to use qualitative
> color maps).
>
>
> I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
> case is not really that clear.
>
> Cheers,
>
> Andy
>
> On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> > functions will be added? If it's just a dozen or less, putting them all
> > into a single namespace sklearn.plot might be easier.
> >
> > This also would avoid discussion about where to put some generic
> > plotting functions (e.g.
> > https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).
> >
> > Roman
> >
> > On 03/04/2019 12:06, Trevor Stephens wrote:
> >> I think #1 if any of these... Plotting functions should hopefully be as
> >> general as possible, so tagging with a specific type of estimator will,
> >> in some scikit-learn utopia, be unnecessary.
> >>
> >> If a general plotter is built, where does it live in other
> >> estimator-specific namespace options? Feels awkward to put it under
> >> every estimator's namespace.
> >>
> >> Then again, there might be a #4 where there is no plot module and
> >> plotting classes live under groups of utilities like introspection,
> >> cross-validation or something?...
> >>
> >> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe  >> > wrote:
> >>
> >>  My preference would be for (1). I don't think the sub-namespace in
> >>  (2) is necessary, and don't like (3), as I would prefer the plotting
> >>  functions to be all in the same namespace sklearn.plot.
> >>
> >>  Andrew
> >>
> >>  <~~~>
> >>  J. Andrew Howe, PhD
> >>  LinkedIn Profile 
> >>  ResearchGate Profile 
> >> 

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Boris Hollas

Am 03.04.19 um 13:59 schrieb Joel Nothman:

The equations in Murphy and Hastie very clearly assume a metric
decomposable over samples (a loss function). Several popular metrics
are not.

For a metric like MSE it will be almost identical assuming the test
sets have almost the same size.
What will be almost identical to what? I suppose you mean that (*) is 
consistent with the scores of the models in the fold (ie, the result of 
cross_val_score) if the loss function is (x-y)².

For something like Recall
(sensitivity) it will be almost identical assuming similar test set
sizes*and*  stratification. For something like precision whose
denominator is determined by the biases of the learnt classifier on
the test dataset, you can't say the same.
I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then 
(*) gives the accuracy.

  For something like ROC AUC
score, relying on some decision function that may not be equivalently
calibrated across splits, evaluating in this way is almost
meaningless.


In any case, I still don't see what may be wrong with (*). Otherwise, 
the warning in the documentation about the use of cross_val_predict 
should be removed or revised.


On the other hand, an example in the documentation uses 
cross_val_scores.mean(). This is debatable since this computes a mean of 
means.




On Wed, 3 Apr 2019 at 22:01, Boris Hollas
  wrote:

I use

sum((cross_val_predict(model, X, y) - y)**2) / len(y)(*)

to evaluate the performance of a model. This conforms with Murphy: Machine Learning, 
section 6.5.3, and Hastie et al: The Elements of Statistical Learning,  eq. 7.48. 
However, according to the documentation of cross_val_predict, "it is not appropriate 
to pass these predictions into an evaluation metric". While it is obvious that 
cross_val_predict is different from cross_val_score, I don't see what should be wrong 
with (*).

Also, the explanation that "cross_val_predict simply returns the labels (or 
probabilities)" is unclear, if not wrong. As I understand it, this function returns 
estimates and no labels or probabilities.

Regards, Boris


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] How to answer questions from big documents?

2019-04-03 Thread Rodrigo Rosenfeld Rosas
Hi everyone, this is my first post here :)

About two weeks ago, due to the low demand in my project, I have been
assigned a completely unusual request: to automatically extract answers
from documents based on machine learning. I've never read anything about
ML, AI or NLP before, so I've been basically doing just that for the past
two weeks.

When it comes to ML, most book recommendations and tutorials I've found so
far use the Python language and tools, so I took the first week to learn
about Python, NumPy, Scikit, Panda, Matplotlib and so on. Then, this week I
started reading about NLP itself, after spending a few days reading about
generic ML algorithms.

So far, I've basically read about Bag of Words, using TF-IDF (or simply
terms count) to convert the words to numeric representations and a few
methods such as the gaussian and multinomial naive bayes methods to train
and predict values. The methods also mention the importance of using the
usual pre-processing methods such as lemmatization and alikes. However,
basically all examples assume that a given text can be classified in one of
the categorized topics, like the sentiment analysis use case. I'm afraid
this doesn't represent my use case, so I'd like to describe it here so that
you could help me identifying which methods I should be looking for.

We have a system with thousands of transactions/deals inputted manually by
an specialized team. Each deal has a set of documents (a dozen per deal
typically) and some documents could have hundreds of pages. The inputing
team has to extract about a thousand fields from those documents for any
particular deal. So, in our database we have all their data and we
typically also know the document specific snippets associated to each field
value.

So, my task is to, given a new document and deal, and based on the previous
answers, fill in as many fields as I could by automatically finding the
corresponding snippets in the new documents. I'm not sure how I should
approach this problem.

For example, I could consider each sentence of the document as a separate
document to be analyzed and compared to the snippets I already have for the
matching data. However, I can't be sure whether some of those sentences
would actually answer the question. For example, maybe there are 6
occurrences in the documents that would answer a particular question/field,
but maybe the inputters only identified 2 or 3 of them.

Also, for any given sentence, it could tell that the answer for a given
field is A or B, or it could be that there's absolutely no association
between the sentence and the field/question, as it would be the case for
most sentences. I know that Scikit provides the predict_proba method, so
that I could try to only consider the sentence as relevant if the
probabilities of answering the question would be above 80%, for example,
but based on a few quick tests I've made with a few sentences and words, I
suspect this won't work very well. Also, it could be quite slow to treat
each sentence of a 500-hundreds of pages documents as a separate document
to be analyzed, so I'm not sure if there are better methods to handle this
use case.

Some of the fields are free-text ones, like company and firm names, for
example, and I suspect those would be the hardest to answer, so I'm trying
to start with the multiple-choice ones, with a finite set of classification.

How would you advise me to look at this problem? Are there any algorithms
you'd recommend me to study for solving this particular problem?

Here are some sample data so that you could get a better understanding of
the problem:

One of the fields is called "Deal Structure" and it could have the
following values: "Asset Purchase", "Stock or Equity Purchase" or "Public
Target Merger" (there are a few others, but this gives you an idea).

So, here are some sentences highlighted for Public Target Merger deals
(those documents come from Edgar Filings public database which are freely
available for US deals):

deal 1 / doc 1: "AGREEMENT AND PLAN OF MERGER, dated as of March 14, 2018
(this “Agreement”), by and among HarborOne Bancorp, Inc., a Massachusetts
corporation (“Buyer”), Massachusetts Acquisitions, LLC, a Maryland limited
liability company of which Buyer is the sole member (“Merger LLC”), and
Coastway Bancorp, Inc., a Maryland corporation (the “Company”)."

"WHEREAS, Buyer, Merger LLC, and the Company intend to effect a merger (the
“Merger”) of Merger LLC with and into the Company in accordance with this
Agreement and the Maryland General Corporation Law (the “MGCL”) and the
Maryland Limited Liability Company Act, as amended (the “MLLCA”), with the
Company to be the surviving entity in the Merger. The Merger will be
followed immediately by a merger of the Company with and into Buyer (the
“Upstream Merger”), with the Buyer to be the surviving entity in the
Upstream Merger. It is intended that the Merger be mutually interdependent
with and a condition precedent to the Upstream Merg

Re: [scikit-learn] Why is cross_val_predict discouraged?

2019-04-03 Thread Joel Nothman
Pull requests improving the documentation are always welcome. At a minimum,
users need to know that these compute different things.

Accuracy is not precision. Precision is the number of true positives
divided by the number of true positives plus false positives. It therefore
cannot be decomposed as a sample-wise measure without knowing the rate of
positive predictions. This rate is dependent on the training data and
algorithm.

I'm not a statistician and cannot speak to issues of computing a mean of
means, but if what we are trying to estimate is the performance on a sample
of size approximately n_t of a model trained on a sample of size
approximately N - n_t, then I wouldn't have thought taking a mean over such
measures (with whatever score function) to be unreasonable.

On Thu., 4 Apr. 2019, 3:51 am Boris Hollas, <
hol...@informatik.htw-dresden.de> wrote:

> Am 03.04.19 um 13:59 schrieb Joel Nothman:
>
> The equations in Murphy and Hastie very clearly assume a metric
> decomposable over samples (a loss function). Several popular metrics
> are not.
>
> For a metric like MSE it will be almost identical assuming the test
> sets have almost the same size.
>
> What will be almost identical to what? I suppose you mean that (*) is
> consistent with the scores of the models in the fold (ie, the result of
> cross_val_score) if the loss function is (x-y)².
>
> For something like Recall
> (sensitivity) it will be almost identical assuming similar test set
> sizes **and** stratification. For something like precision whose
> denominator is determined by the biases of the learnt classifier on
> the test dataset, you can't say the same.
>
> I can't follow here. If the loss function is L(x,y) = 1_{x = y}, then (*)
> gives the accuracy.
>
>  For something like ROC AUC
> score, relying on some decision function that may not be equivalently
> calibrated across splits, evaluating in this way is almost
> meaningless.
>
> In any case, I still don't see what may be wrong with (*). Otherwise, the
> warning in the documentation about the use of cross_val_predict should be
> removed or revised.
>
> On the other hand, an example in the documentation uses
> cross_val_scores.mean(). This is debatable since this computes a mean of
> means.
>
>
>
> On Wed, 3 Apr 2019 at 22:01, Boris Hollas 
>  wrote:
>
> I use
>
> sum((cross_val_predict(model, X, y) - y)**2) / len(y)(*)
>
> to evaluate the performance of a model. This conforms with Murphy: Machine 
> Learning, section 6.5.3, and Hastie et al: The Elements of Statistical 
> Learning,  eq. 7.48. However, according to the documentation of 
> cross_val_predict, "it is not appropriate to pass these predictions into an 
> evaluation metric". While it is obvious that cross_val_predict is different 
> from cross_val_score, I don't see what should be wrong with (*).
>
> Also, the explanation that "cross_val_predict simply returns the labels (or 
> probabilities)" is unclear, if not wrong. As I understand it, this function 
> returns estimates and no labels or probabilities.
>
> Regards, Boris
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Eric Ma
This is not a strongly-held suggestion - but what about adopting
YellowBrick as the plotting API for sklearn? Not sure how exactly the
interaction would work - could be PRs to their library, or ask them to
integrate into sklearn, or do a lock-step dance with versions but maintain
separate teams? (I know it raises more questions than answers, but wanted
to put it out there.)

On Wed, Apr 3, 2019 at 4:07 PM Joel Nothman  wrote:

> With option 1, sklearn.plot is likely to import large chunks of the
> library (particularly, but not exclusively, if the plotting function
> "does the work" as Andy suggests). This is under the assumption that
> one plot function will want to import trees, another GPs, etc. Unless
> we move to lazy imports, that would be against the current convention
> that importing sklearn is fairly minimal.
>
> I do like Andy's idea of framing this discussion more clearly around
> likely candidates.
>
> On Thu, 4 Apr 2019 at 00:10, Andreas Mueller  wrote:
> >
> > I think what was not clear from the question is that there is actually
> > quite different kinds of plotting functions, and many of these are tied
> > to existing code.
> >
> > Right now we have some that are specific to trees (plot_tree) and to
> > gradient boosting (plot_partial_dependence).
> >
> > I think we want more general functions, and plot_partial_dependence has
> > been extended to general estimators.
> >
> > However, the plotting functions might be generic wrt the estimator, but
> > relate to a specific function, say plotting results of GridSearchCV.
> > Then one might argue that having the plotting function close to
> > GridSearchCV might make sense.
> > Similarly for plotting partial dependence plots and feature importances,
> > it might be a bit strange to have the plotting functions not next to the
> > functions that compute these.
> > Another question would be is whether the plotting functions also "do the
> > work" in some cases:
> > Do we want plot_partial_dependence also to compute the partial
> > dependence? (I would argue yes but either way the result is a bit
> strange).
> > In that case you have somewhat of the same functionality in two
> > different modules, unless you also put the "compute partial dependence"
> > function in the plotting module as well,
> > which is a bit strange.
> >
> > Maybe we could inform this discussion by listing candidate plotting
> > functions, and also considering whether they "do the work" and where the
> > "work" function is.
> >
> > Other examples are plotting the confusion matrix, which probably should
> > also compute the confusion matrix (it's fast and so that would be
> > convenient), and so it would "duplicate" functionality from the metrics
> > module.
> >
> > Plotting learning curves and validation curves should probably not do
> > the work as it's pretty involved, and so someone would need to import
> > the learning and validation curves from model selection, and then the
> > plotting functions from a plotting module.
> >
> > Calibrations curves and P/R curves and roc curves are also pretty fast
> > to compute (and passing around the arguments is somewhat error prone) so
> > I would say the plotting functions for these should do the work as well.
> >
> > Anyway, you can see that many plotting functions are actually associated
> > with functions in existing modules and the interactions are a bit
> unclear.
> >
> > The only plotting functions I haven't mentioned so far that I thought
> > about in the past are "2d scatter" and "plot decision function". These
> > would be kind of generic, but mostly used in the examples.
> > Though having a discrete 2d scatter function would be pretty nice
> > (plt.scatter doesn't allow legends and makes it hard to use qualitative
> > color maps).
> >
> >
> > I think I would vote for option (1), "sklearn.plot.plot_zzz" but the
> > case is not really that clear.
> >
> > Cheers,
> >
> > Andy
> >
> > On 4/3/19 7:35 AM, Roman Yurchak via scikit-learn wrote:
> > > +1 for options 1 and +0.5 for 3. Do we anticipate that many plotting
> > > functions will be added? If it's just a dozen or less, putting them all
> > > into a single namespace sklearn.plot might be easier.
> > >
> > > This also would avoid discussion about where to put some generic
> > > plotting functions (e.g.
> > >
> https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479
> ).
> > >
> > > Roman
> > >
> > > On 03/04/2019 12:06, Trevor Stephens wrote:
> > >> I think #1 if any of these... Plotting functions should hopefully be
> as
> > >> general as possible, so tagging with a specific type of estimator
> will,
> > >> in some scikit-learn utopia, be unnecessary.
> > >>
> > >> If a general plotter is built, where does it live in other
> > >> estimator-specific namespace options? Feels awkward to put it under
> > >> every estimator's namespace.
> > >>
> > >> Then again, there might be a #4 where there is no plot module and
> > >> plotting classes live under groups 

[scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Joel Nothman
The core developers of Scikit-learn have recently voted to welcome
Thomas Fan and Nicolas Hug to the team, in recognition of their
efforts and trustworthiness as contributors. Both happen to be working
with Andy Mueller at Columbia University at the moment.
Congratulations and thanks to them both!
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Hanmin Qin
Congratulations and welcome to the team!
Hanmin Qin
- Original Message -
From: Joel Nothman 
To: Scikit-learn user and developer mailing list 
Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
Date: 2019-04-04 07:52


The core developers of Scikit-learn have recently voted to welcome
Thomas Fan and Nicolas Hug to the team, in recognition of their
efforts and trustworthiness as contributors. Both happen to be working
with Andy Mueller at Columbia University at the moment.
Congratulations and thanks to them both!
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core developers: thomasjpfan and nicolashug

2019-04-03 Thread Andreas Mueller
Congratulations guys! Great work! Looking forward to much more! Proud to
have you on the team!

Now we in NYC can approve our own pull requests ;)



Sent from phone. Please excuse spelling and brevity.

On Wed, Apr 3, 2019, 21:08 Hanmin Qin  wrote:

> Congratulations and welcome to the team!
>
> Hanmin Qin
>
> - Original Message -
> From: Joel Nothman 
> To: Scikit-learn user and developer mailing list 
> Subject: [scikit-learn] New core developers: thomasjpfan and nicolashug
> Date: 2019-04-04 07:52
>
>
> The core developers of Scikit-learn have recently voted to welcome
> Thomas Fan and Nicolas Hug to the team, in recognition of their
> efforts and trustworthiness as contributors. Both happen to be working
> with Andy Mueller at Columbia University at the moment.
> Congratulations and thanks to them both!
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn