Re: [scikit-learn] What is the FeatureAgglomeration algorithm?

2018-07-25 Thread Raphael C
Is it expected that all three linkages options should give the same result
in my toy example?

Raphael

On Thu, 26 Jul 2018 at 06:20 Gael Varoquaux 
wrote:

> FeatureAgglomeration uses the Ward, complete linkage, or average linkage,
> algorithms, depending on the choice of "linkage". These are well
> documented in the literature, or on wikipedia.
>
> Gaël
>
> On Thu, Jul 26, 2018 at 06:05:21AM +0100, Raphael C wrote:
> > Hi,
>
> > I am trying to work out what, in precise mathematical terms,
> > [FeatureAgglomeration][1] does and would love some help. Here is some
> example
> > code:
>
>
> > import numpy as np
> > from sklearn.cluster import FeatureAgglomeration
> > for S in ['ward', 'average', 'complete']:
> > FA = FeatureAgglomeration(linkage=S)
> > print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]])))
>
> > This outputs:
>
> >
>
> > [[  6. -50.]
> >  [  2.   0.]]
> > [[  6. -50.]
> >  [  2.   0.]]
> > [[  6. -50.]
> >  [  2.   0.]]
>
> > Is it possible to say mathematically how these values have been computed?
>
> > Also, what exactly does linkage do and why doesn't it seem to make any
> > difference which option you choose?
>
> > Raphael
>
>
> >   [1]: http://scikit-learn.org/stable/modules/generated/
> > sklearn.cluster.FeatureAgglomeration.html
>
> > PS I also asked at
> > https://stackoverflow.com/questions/51526616/
> >
> what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make
>
>
> > ___
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> --
> Gael Varoquaux
> Senior Researcher, INRIA Parietal
> NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
> Phone:  ++ 33-1-69-08-79-68
> http://gael-varoquaux.infohttp://twitter.com/GaelVaroquaux
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] What is the FeatureAgglomeration algorithm?

2018-07-25 Thread Raphael C
Hi,

I am trying to work out what, in precise mathematical terms,
[FeatureAgglomeration][1] does and would love some help. Here is some
example code:


import numpy as np
from sklearn.cluster import FeatureAgglomeration
for S in ['ward', 'average', 'complete']:
FA = FeatureAgglomeration(linkage=S)
print(FA.fit_transform(np.array([[-50,6,6,7,], [0,1,2,3]])))

This outputs:



[[  6. -50.]
 [  2.   0.]]
[[  6. -50.]
 [  2.   0.]]
[[  6. -50.]
 [  2.   0.]]

Is it possible to say mathematically how these values have been computed?

Also, what exactly does linkage do and why doesn't it seem to make any
difference which option you choose?

Raphael


  [1]:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.FeatureAgglomeration.html

PS I also asked at
https://stackoverflow.com/questions/51526616/what-does-featureagglomeration-compute-mathematically-and-when-does-linkage-make
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Finding a single cluster in 1d data

2018-04-14 Thread Raphael C
Thank you very much!  I didn't know about jenkspy.

Raphael

On 13 April 2018 at 02:19, Pedro Pazzini <pedropazz...@gmail.com> wrote:
> Hi Raphael.
>
> An option to highlight a dense region in your vector is to use a density
> estimator (http://scikit-learn.org/stable/modules/density.html).
>
> But I think that the python module jenkspy
> (https://pypi.python.org/pypi/jenkspy and https://github.com/mthh/jenkspy)
> can help you also. The method finds the natural breaks of data in 1d
> (https://en.wikipedia.org/wiki/Jenks_natural_breaks_optimization). I think
> that if you find a good value for the 'nb_class' parameter you can separate
> the dense region of your data from the sparse one.
>
> K-means is a generalization of Jenks break optimization for multivariate
> data, so, maybe, you could use the K-means module of scikit-learn for that
> also. On this approach, personally, I think the jenskpy module more
> straightforward.
>
> I hope it helps.
>
> Pedro Pazzini
>
> 2018-04-12 16:22 GMT-03:00 Raphael C <drr...@gmail.com>:
>>
>> I have a set of points in 1d represented by a list X of floating point
>> numbers.  The list has one dense section and the rest is sparse and I
>> want to find the dense part. I can't release the actual data but here
>> is a simulation:
>>
>> N = 100
>>
>> start = 0
>> points = []
>> rate = 0.1
>> for i in range(N):
>> points.append(start)
>> start = start + random.expovariate(rate)
>> rate = 10
>> for i in range(N*10):
>> points.append(start)
>> start = start + random.expovariate(rate)
>> rate = 0.1
>> for i in range(N):
>> points.append(start)
>> start = start + random.expovariate(rate)
>> plt.hist(points, bins = 100)
>> plt.show()
>>
>> I would like to use scikit learn to find the dense region. This feels
>> a little like outlier detection or the task of finding one cluster
>> with noise.
>>
>> Is there a suitable method in scikit learn for this task?
>>
>> Raphael
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Parallel MLP version

2017-12-20 Thread Raphael C
I believe tensorflow will do what you want.

Raphael

On 20 Dec 2017 16:43, "Luigi Lomasto" 
wrote:

> Hi all,
>
> I have a computational problem to training my neural network so, can you
> say me if exists any parallel version about MLP library?
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Unclear help file about sklearn.decomposition.pca

2017-10-17 Thread Raphael C
How about including the scaling that people might want to use in the
User Guide examples?

Raphael

On 17 October 2017 at 16:40, Andreas Mueller  wrote:
> In general scikit-learn avoids automatic preprocessing.
> That's a convention to give the user more control and decrease surprising
> behavior (ostensibly).
> So scikit-learn will usually do what the algorithm is supposed to do, and
> nothing more.
>
> I'm not sure what the best way do document this is, as this has come up with
> different models.
> For example the R wrapper of libsvm does automatic scaling, while we apply
> the SVM.
>
> We could add "this model does not do any automatic preprocessing" to all
> docstrings, but that seems
> a bit redundant. We could add it to
> https://github.com/scikit-learn/scikit-learn/pull/9517, but
> that is probably not where you would have looked.
>
> Other suggestions welcome.
>
>
> On 10/16/2017 03:29 PM, Ismael Lemhadri wrote:
>
> Thank you all for your feedback.
> The initial problem I came with wasnt the definition of PCA but what the
> sklearn method does. In practice I would always make sure the data is both
> centered and scaled before performing PCA. This is the recommended method
> because without scaling, the biggest direction could wrongly seem to explain
> a huge fraction of the variance.
> So my point was simply to clarify in the help file and the user guide what
> the PCA class does precisely to leave no unclarity to the reader. Moving
> forward I have now submitted a pull request on github as initially suggested
> by Roman on this thread.
> Best,
> Ismael
>
> On Mon, 16 Oct 2017 at 11:49 AM,  wrote:
>>
>> Send scikit-learn mailing list submissions to
>> scikit-learn@python.org
>>
>> To subscribe or unsubscribe via the World Wide Web, visit
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> or, via email, send a message with subject or body 'help' to
>> scikit-learn-requ...@python.org
>>
>> You can reach the person managing the list at
>> scikit-learn-ow...@python.org
>>
>> When replying, please edit your Subject line so it is more specific
>> than "Re: Contents of scikit-learn digest..."
>>
>>
>> Today's Topics:
>>
>>1. Re: 1. Re: unclear help file for sklearn.decomposition.pca
>>   (Andreas Mueller)
>>2. Re: 1. Re: unclear help file for sklearn.decomposition.pca
>>   (Oliver Tomic)
>>
>>
>> --
>>
>> Message: 1
>> Date: Mon, 16 Oct 2017 14:44:51 -0400
>> From: Andreas Mueller 
>> To: scikit-learn@python.org
>> Subject: Re: [scikit-learn] 1. Re: unclear help file for
>> sklearn.decomposition.pca
>> Message-ID: <35142868-fce9-6cb3-eba3-015a0b106...@gmail.com>
>> Content-Type: text/plain; charset="utf-8"; Format="flowed"
>>
>>
>>
>> On 10/16/2017 02:27 PM, Ismael Lemhadri wrote:
>> > @Andreas Muller:
>> > My references do not assume centering, e.g.
>> > http://ufldl.stanford.edu/wiki/index.php/PCA
>> > any reference?
>> >
>> It kinda does but is not very clear about it:
>>
>> This data has already been pre-processed so that each of the
>> features\textstyle x_1and\textstyle x_2have about the same mean (zero)
>> and variance.
>>
>>
>>
>> Wikipedia is much clearer:
>> Consider a datamatrix
>> ,*X*, with
>> column-wise zeroempirical mean
>> (the sample mean of each
>> column has been shifted to zero), where each of the/n/rows represents a
>> different repetition of the experiment, and each of the/p/columns gives
>> a particular kind of feature (say, the results from a particular sensor).
>> https://en.wikipedia.org/wiki/Principal_component_analysis#Details
>>
>> I'm a bit surprised to find that ESL says "The SVD of the centered
>> matrix X is another way of expressing the principal components of the
>> variables in X",
>> so they assume scaling? They don't really have a great treatment of PCA,
>> though.
>>
>> Bishop  and Murphy
>>  are pretty clear
>> that they subtract the mean (or assume zero mean) but don't standardize.
>> -- next part --
>> An HTML attachment was scrubbed...
>> URL:
>> 
>>
>> --
>>
>> Message: 2
>> Date: Mon, 16 Oct 2017 20:48:29 +0200
>> From: Oliver Tomic 
>> To: "Scikit-learn mailing list" 
>> Cc: 
>> Subject: Re: [scikit-learn] 1. Re: unclear help file for
>> sklearn.decomposition.pca
>> Message-ID: <15f26840d65.e97b33c25239.3934951873824890...@zoho.com>
>> Content-Type: text/plain; charset="utf-8"
>>
>> Dear Ismael,
>>
>>
>>
>> PCA should always involve at 

Re: [scikit-learn] Truncated svd not working for complex matrices

2017-08-11 Thread Raphael C
Although the first priority should be correctness (in implementation
and documentation) and it makes sense to explicitly test for inputs
for which code will give the wrong answer, it would be great if we
could support complex data types, especially where it is very little
extra work.

Raphael

On 11 August 2017 at 05:41, Joel Nothman  wrote:
> Should we be more explicitly forbidding complex data in most estimators, and
> perhaps allow it in a few where it is tested (particularly decomposition)?
>
> On 11 August 2017 at 01:08, André Melo 
> wrote:
>>
>> Actually, it makes more sense to change
>>
>> B = safe_sparse_dot(Q.T, M)
>>
>> To
>> B = safe_sparse_dot(Q.T.conj(), M)
>>
>> On 10 August 2017 at 16:56, André Melo 
>> wrote:
>> > Hi Olivier,
>> >
>> > Thank you very much for your reply. I was convinced it couldn't be a
>> > fundamental mathematical issue because the singular values were coming
>> > out exactly right, so it had to be a problem with the way complex
>> > values were being handled.
>> >
>> > I decided to look at the source code and it turns out the problem is
>> > when the following transformation is applied:
>> >
>> > U = np.dot(Q, Uhat)
>> >
>> > Replacing this by
>> >
>> > U = np.dot(Q.conj(), Uhat)
>> >
>> > solves the issue! Should I report this on github?
>> >
>> > On 10 August 2017 at 16:13, Olivier Grisel 
>> > wrote:
>> >> I have no idea whether the randomized SVD method is supposed to work
>> >> for
>> >> complex data or not (from a mathematical point of view). I think that
>> >> all
>> >> scikit-learn estimators assume real data (or integer data for class
>> >> labels)
>> >> and our input validation utilities will cast numeric values to float64
>> >> by
>> >> default. This might be the cause of your problem. Have a look at the
>> >> source
>> >> code to confirm. The reference to the paper can also be found in the
>> >> docstring of those functions.
>> >>
>> >> --
>> >> Olivier
>> >>
>> >> ___
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] decision trees

2017-03-29 Thread Raphael C
There is https://github.com/scikit-learn/scikit-learn/pull/4899 .

It looks like it is waiting for review?

Raphael

On 29 March 2017 at 11:50, federico vaggi  wrote:
> That's a really good point.  Do you know of any systematic studies about the
> two different encodings?
>
> Finally: wasn't there a PR for RF to accept categorical variables as inputs?
>
> On Wed, 29 Mar 2017 at 11:57, Olivier Grisel 
> wrote:
>>
>> Integer coding will indeed make the DT assume an arbitrary ordering
>> while one-hot encoding does not force the tree model to make that
>> assumption.
>>
>> However in practice when the depth of the trees is not too limited (or
>> if you use a large enough ensemble of trees), the model will have
>> enough flexibility to introduce as many splits as necessary to isolate
>> individual categories in the integer and therefore the arbitrary
>> ordering assumption is not a problem.
>>
>> On the other hand using one-hot encoding can introduce a detrimental
>> inductive bias on random forests: random forest uses uniform random
>> feature sampling when deciding which feature to split on (e.g. pick
>> the best split out of 25% of the features selected at random).
>>
>> Let's consider the following example: assume you have an
>> heterogeneously typed dataset with 99 numeric features and 1
>> categorical feature with categorical cardinality 1000 (1000 possible
>> values for that features):
>>
>> - the RF will have one chance in 100 to pick each feature (categorical
>> or numerical) as a candidate for the next split when using integer
>> coding,
>> - the RF will have 0.1% chance of picking each numerical feature and
>> 99% chance to select a candidate feature split on a category of the
>> unique categorical feature when using one-hot encoding.
>>
>> The inductive bias of one-encoding on RFs can therefore completely
>> break the feature balancing. The feature encoding will also impact the
>> inductive bias with respect the importance of the depth of the trees,
>> even when feature splits are selected fully deterministically.
>>
>> Finally one-hot encoding features with large categorical cardinalities
>> will be much slower then when using naive integer coding.
>>
>> TL;DNR: naive theoretical analysis based only on the ordering
>> assumption can be misleading. Inductive biases of each feature
>> encoding are more complex to evaluate. Use cross-validation to decide
>> which is the best on your problem. Don't ignore computational
>> considerations (CPU and memory usage).
>>
>> --
>> Olivier
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Missing data and decision trees

2016-10-13 Thread Raphael C
You can simply make a new binary feature (per feature that might have a
missing value) that is 1 if the value is missing and 0 otherwise.  The RF
can then work out what to do with this information.

I don't know how this compares in practice to more sophisticated approaches.

Raphael

On Thursday, October 13, 2016, Stuart Reynolds 
wrote:

> I'm looking for a decision tree and RF implementation that supports
> missing data (without imputation) -- ideally in Python, Java/Scala or C++.
>
> It seems that scikit's decision tree algorithm doesn't allow this --
> which is disappointing because its one of the few methods that should be
> able to sensibly handle problems with high amounts of missingness.
>
> Are there plans to allow missing data in scikit's decision trees?
>
> Also, is there any particular reason why missing values weren't supported
> originally (e.g. integrates poorly with other features)
>
> Regards
> - Stuart
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Raphael C
On 10 October 2016 at 12:22, Sean Violante <sean.viola...@gmail.com> wrote:
> no ( but please check !)
>
> sample weights should be the counts for the respective label (0/1)
>
> [ I am actually puzzled about the glm help file - proportions loses how
> often an input data 'row' was present relative to the other - though you
> could do this by repeating the row 'n' times]

I think we might be talking at cross purposes.

I have a matrix X where each row is a feature vector. I also have an
array y where y[i] is a real number between 0 and 1. I would like to
build a regression model that predicts the y values given the X rows.

Now each y[i] value in fact comes from simply counting the number of
positive labelled elements in a particular set (set i) and dividing by
the number of elements in that set.  So I can easily fit this into the
model given by the R package glm by replacing each y[i] value by a
pair of "Number of positives" and "Number of negatives" (this is case
2 in the docs I quoted) or using case 3 which asks for the y[i] plus
the total number of elements in set i.

I don't see how a single integer for sample_weight[i] would cover this
information but I am sure I must have misunderstood.  At best it seems
it could cover the number of positive values but this is missing half
the information.

Raphael

>
> On Mon, Oct 10, 2016 at 1:15 PM, Raphael C <drr...@gmail.com> wrote:
>>
>> How do I use sample_weight for my use case?
>>
>> In my case is "y" an array of 0s and 1s and sample_weight then an
>> array real numbers between 0 and 1 where I should make sure to set
>> sample_weight[i]= 0 when y[i] = 0?
>>
>> Raphael
>>
>> On 10 October 2016 at 12:08, Sean Violante <sean.viola...@gmail.com>
>> wrote:
>> > should be the sample weight function in fit
>> >
>> >
>> > http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> >
>> > On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drr...@gmail.com> wrote:
>> >>
>> >> I just noticed this about the glm package in R.
>> >> http://stats.stackexchange.com/a/26779/53128
>> >>
>> >> "
>> >> The glm function in R allows 3 ways to specify the formula for a
>> >> logistic regression model.
>> >>
>> >> The most common is that each row of the data frame represents a single
>> >> observation and the response variable is either 0 or 1 (or a factor
>> >> with 2 levels, or other varibale with only 2 unique values).
>> >>
>> >> Another option is to use a 2 column matrix as the response variable
>> >> with the first column being the counts of 'successes' and the second
>> >> column being the counts of 'failures'.
>> >>
>> >> You can also specify the response as a proportion between 0 and 1,
>> >> then specify another column as the 'weight' that gives the total
>> >> number that the proportion is from (so a response of 0.3 and a weight
>> >> of 10 is the same as 3 'successes' and 7 'failures')."
>> >>
>> >> Either of the last two options would do for me.  Does scikit-learn
>> >> support either of these last two options?
>> >>
>> >> Raphael
>> >>
>> >> On 10 October 2016 at 11:55, Raphael C <drr...@gmail.com> wrote:
>> >> > I am trying to perform regression where my dependent variable is
>> >> > constrained to be between 0 and 1. This constraint comes from the
>> >> > fact
>> >> > that it represents a count proportion. That is counts in some
>> >> > category
>> >> > divided by a total count.
>> >> >
>> >> > In the literature it seems that one common way to tackle this is to
>> >> > use logistic regression. However, it appears that in scikit learn
>> >> > logistic regression is only available as a classifier
>> >> >
>> >> >
>> >> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> >> > ) . Is that right?
>> >> >
>> >> > Is there another way to perform regression using scikit learn where
>> >> > the dependent variable is a count proportion?
>> >> >
>> >> > Thanks for any help.
>> >> >
>> >> > Raphael
>> >> ___
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> >
>> >
>> > ___
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Raphael C
How do I use sample_weight for my use case?

In my case is "y" an array of 0s and 1s and sample_weight then an
array real numbers between 0 and 1 where I should make sure to set
sample_weight[i]= 0 when y[i] = 0?

Raphael

On 10 October 2016 at 12:08, Sean Violante <sean.viola...@gmail.com> wrote:
> should be the sample weight function in fit
>
> http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>
> On Mon, Oct 10, 2016 at 1:03 PM, Raphael C <drr...@gmail.com> wrote:
>>
>> I just noticed this about the glm package in R.
>> http://stats.stackexchange.com/a/26779/53128
>>
>> "
>> The glm function in R allows 3 ways to specify the formula for a
>> logistic regression model.
>>
>> The most common is that each row of the data frame represents a single
>> observation and the response variable is either 0 or 1 (or a factor
>> with 2 levels, or other varibale with only 2 unique values).
>>
>> Another option is to use a 2 column matrix as the response variable
>> with the first column being the counts of 'successes' and the second
>> column being the counts of 'failures'.
>>
>> You can also specify the response as a proportion between 0 and 1,
>> then specify another column as the 'weight' that gives the total
>> number that the proportion is from (so a response of 0.3 and a weight
>> of 10 is the same as 3 'successes' and 7 'failures')."
>>
>> Either of the last two options would do for me.  Does scikit-learn
>> support either of these last two options?
>>
>> Raphael
>>
>> On 10 October 2016 at 11:55, Raphael C <drr...@gmail.com> wrote:
>> > I am trying to perform regression where my dependent variable is
>> > constrained to be between 0 and 1. This constraint comes from the fact
>> > that it represents a count proportion. That is counts in some category
>> > divided by a total count.
>> >
>> > In the literature it seems that one common way to tackle this is to
>> > use logistic regression. However, it appears that in scikit learn
>> > logistic regression is only available as a classifier
>> >
>> > (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
>> > ) . Is that right?
>> >
>> > Is there another way to perform regression using scikit learn where
>> > the dependent variable is a count proportion?
>> >
>> > Thanks for any help.
>> >
>> > Raphael
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>
>
>
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Using logistic regression with count proportions data

2016-10-10 Thread Raphael C
I am trying to perform regression where my dependent variable is
constrained to be between 0 and 1. This constraint comes from the fact
that it represents a count proportion. That is counts in some category
divided by a total count.

In the literature it seems that one common way to tackle this is to
use logistic regression. However, it appears that in scikit learn
logistic regression is only available as a classifier
(http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
) . Is that right?

Is there another way to perform regression using scikit learn where
the dependent variable is a count proportion?

Thanks for any help.

Raphael
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Github project management tools

2016-09-29 Thread Raphael C
My apologies I see it is in the spreadsheet. It would be great to see
this work finished for 0.19 if at all possible IMHO.

Raphael

On 29 September 2016 at 20:12, Raphael C <drr...@gmail.com> wrote:
> I hope this isn't out of place but I notice that
> https://github.com/scikit-learn/scikit-learn/pull/4899 is not in the
> list. It seems like a very worthwhile addition and the PR appears
> stalled at present.
>
> Raphael
>
> On 29 September 2016 at 15:05, Joel Nothman <joel.noth...@gmail.com> wrote:
>> I agree that being able to identify which PRs are stalled on the
>> contributor's part, which on reviewers' part, and since when, would be
>> great. I'm not sure we've come up with a way that'll work though.
>>
>> In terms of backlog, I've wondered if just getting things into a spreadsheet
>> would help:
>>
>> https://docs.google.com/spreadsheets/d/1LdzNxQbn7A0Ao8zlUBgnvT42929JpAe9958YxKCubjE/edit
>>
>> What other features of an Issue / PR would be useful to
>> sort/filter/pivottable on in a spreadsheet form like this?
>>
>> (It would be extra nice if we could modify titles and labels within the
>> spreadsheet and have them update via the GitHub API, but I'm not sure I'll
>> get around to making that feature :P)
>>
>>
>> On 29 September 2016 at 23:45, Andreas Mueller <t3k...@gmail.com> wrote:
>>>
>>> So I made a project for 0.19:
>>>
>>> https://github.com/scikit-learn/scikit-learn/projects/5
>>>
>>> The idea would be to drag and drop issues and PRs so that the important
>>> ones are at the top.
>>> We could also add an "important" column, currently the scrolling is pretty
>>> annoying.
>>> Thoughts?
>>>
>>>
>>>
>>>
>>> On 09/28/2016 03:29 PM, Nelle Varoquaux wrote:
>>>>
>>>> On 28 September 2016 at 12:24, Andreas Mueller <t3k...@gmail.com> wrote:
>>>>>
>>>>>
>>>>> On 09/28/2016 02:21 PM, Nelle Varoquaux wrote:
>>>>>>
>>>>>>
>>>>>> I think the only ones worth having are the ones that can be dealt with
>>>>>> automatically and the ones that will not be used frequently:
>>>>>>
>>>>>> - stalled after 30 days of inactivity [can be done automatically]
>>>>>> - in dispute [I don't expect it to be used often].
>>>>>
>>>>> I think "in dispute" is actually one of the most common statuses among
>>>>> PRs.
>>>>> Or maybe I have a skewed picture of things.
>>>>> Many PRs stalled because it is not clear whether the proposed solution
>>>>> is a
>>>>> good one.
>>>>
>>>> On the stalled one, sure, but there are a lot of PRs being merged
>>>> fairly quickly. So over all, I think it is quite rare. No?
>>>>
>>>>> It would be great to have some way to get through the backlog of 400 PRs
>>>>> and
>>>>> I think tagging them might be useful.
>>>>> We rarely reject PRs, we could also revisit that policy.
>>>>>
>>>>> For the backlog, it's pretty unclear to me how many are waiting for
>>>>> reviews,
>>>>> how many are waiting for changes,
>>>>> and how many are disputed.
>>>>> Tagging these might help people who want to review to find things to
>>>>> review,
>>>>> and people who want to code to pick
>>>>> up stalled PRs.
>>>>
>>>> That sounds like a great use of labels, thought all of these need to
>>>> be tagged manually.
>>>>
>>>>> ___
>>>>> scikit-learn mailing list
>>>>> scikit-learn@python.org
>>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>>
>>>> ___
>>>> scikit-learn mailing list
>>>> scikit-learn@python.org
>>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

2016-08-31 Thread Raphael C
Can you provide a reproducible example?
Raphael

On Wednesday, August 31, 2016, Douglas Chan  wrote:

> Hello everyone,
>
> I notice conditions when Feature Importance values do not add up to 1 in
> ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I
> wonder if there’s a bug in the code.
>
> This error occurs when the ensemble has a large number of estimators.  The
> exact conditions depend variously.  For example, the error shows up sooner
> with a smaller amount of training samples.  Or, if the depth of the tree is
> large.
>
> When this error appears, the predicted value seems to have converged.  But
> it’s unclear if the error is causing the predicted value not to change with
> more estimators.  In fact, the feature importance sum goes lower and lower
> with more estimators thereafter.
>
> I wonder if we’re hitting some floating point calculation error.
>
> Looking forward to hear your thoughts on this.
>
> Thank you!
> -Doug
>
>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does NMF optimise over observed values

2016-08-29 Thread Raphael C
On Monday, August 29, 2016, Andreas Mueller <t3k...@gmail.com> wrote:

>
>
> On 08/28/2016 01:16 PM, Raphael C wrote:
>
>
>
> On Sunday, August 28, 2016, Andy <t3k...@gmail.com
> <javascript:_e(%7B%7D,'cvml','t3k...@gmail.com');>> wrote:
>
>>
>>
>> On 08/28/2016 12:29 PM, Raphael C wrote:
>>
>> To give a little context from the web, see e.g. http://www.quuxlabs.com/b
>> log/2010/09/matrix-factorization-a-simple-tutorial-and-
>> implementation-in-python/ where it explains:
>>
>> "
>> A question might have come to your mind by now: if we find two matrices 
>> [image:
>> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
>> \mathbf{Q}] approximates [image: \mathbf{R}], isn’t that our predictions
>> of all the unseen ratings will all be zeros? In fact, we are not really
>> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
>> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
>> try to minimise the errors of the observed user-item pairs.
>> "
>>
>> Yes, the sklearn interface is not meant for matrix completion but
>> matrix-factorization.
>> There was a PR for some matrix completion for missing value imputation at
>> some point.
>>
>> In general, scikit-learn doesn't really implement anything for
>> recommendation algorithms as that requires a different interface.
>>
>
> Thanks Andy. I just looked up that PR.
>
> I was thinking simply producing a different factorisation optimised only
> over the observed values wouldn't need a new interface. That in itself
> would be hugely useful.
>
> Depends. Usually you don't want to complete all values, but only compute a
> factorization. What do you return? Only the factors?
>
> The PR implements completing everything, and that you can do with the
> transformer interface. I'm not sure what the status of the PR is,
> but doing that with NMF instead of SVD would certainly also be interesting.
>

I was thinking you would literally return W and H so that WH approx X.  The
user can then decide what to do with the factorisation just like when doing
SVD.

Raphael
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does NMF optimise over observed values

2016-08-28 Thread Raphael C
On Sunday, August 28, 2016, Andy <t3k...@gmail.com> wrote:

>
>
> On 08/28/2016 12:29 PM, Raphael C wrote:
>
> To give a little context from the web, see e.g. http://www.quuxlabs.com/
> blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-
> in-python/ where it explains:
>
> "
> A question might have come to your mind by now: if we find two matrices 
> [image:
> \mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
> \mathbf{Q}] approximates [image: \mathbf{R}], isn’t that our predictions
> of all the unseen ratings will all be zeros? In fact, we are not really
> trying to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such
> that we can reproduce [image: \mathbf{R}] exactly. Instead, we will only
> try to minimise the errors of the observed user-item pairs.
> "
>
> Yes, the sklearn interface is not meant for matrix completion but
> matrix-factorization.
> There was a PR for some matrix completion for missing value imputation at
> some point.
>
> In general, scikit-learn doesn't really implement anything for
> recommendation algorithms as that requires a different interface.
>

Thanks Andy. I just looked up that PR.

I was thinking simply producing a different factorisation optimised only
over the observed values wouldn't need a new interface. That in itself
would be hugely useful.

I can see that providing a full drop in recommender system would involve
more work.

Raphael
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Does NMF optimise over observed values

2016-08-28 Thread Raphael C
To give a little context from the web, see e.g.
http://www.quuxlabs.com/blog/2010/09/matrix-factorization-a-simple-tutorial-and-implementation-in-python/
where
it explains:

"
A question might have come to your mind by now: if we find two matrices [image:
\mathbf{P}] and [image: \mathbf{Q}] such that [image: \mathbf{P} \times
\mathbf{Q}] approximates [image: \mathbf{R}], isn’t that our predictions of
all the unseen ratings will all be zeros? In fact, we are not really trying
to come up with [image: \mathbf{P}] and [image: \mathbf{Q}] such that we
can reproduce [image: \mathbf{R}] exactly. Instead, we will only try to
minimise the errors of the observed user-item pairs.
"

Raphael

On Sunday, August 28, 2016, Raphael C <drr...@gmail.com> wrote:

> Thank you for the quick reply.  Just to make sure I understand, if X is
> sparse and n by n with X[0,0] = 1, X_[n-1, n-1]=0 explicitly set (that is
> only two values are set in X) then this is treated the same for the
> purposes of the objective function  as the all zeros n by n matrix with
> X[0,0] set to 1? That is all elements of X that are not specified
> explicitly are assumed to be 0?
>
> It would be really useful if it were possible to have a version of NMF
> where contributions to the objective function are only counted where the
> value is explicitly set in X.  This is AFAIK the standard formulation for
> collaborative filtering. Would there be any interest in doing this? In
> theory it should be a simple modification of the optimisation code.
>
> Raphael
>
>
>
> On Sunday, August 28, 2016, Arthur Mensch <arthur.men...@inria.fr
> <javascript:_e(%7B%7D,'cvml','arthur.men...@inria.fr');>> wrote:
>
>> Zeros are considered as zeros in the objective function, not as missing
>> values - - i.e. no mask in the loss function.
>> Le 28 août 2016 16:58, "Raphael C" <drr...@gmail.com> a écrit :
>>
>> What I meant was, how is the objective function defined when X is sparse?
>>
>> Raphael
>>
>>
>> On Sunday, August 28, 2016, Raphael C <drr...@gmail.com> wrote:
>>
>>> Reading the docs for http://scikit-learn.org/st
>>> able/modules/generated/sklearn.decomposition.NMF.html it says
>>>
>>> The objective function is:
>>>
>>> 0.5 * ||X - WH||_Fro^2
>>> + alpha * l1_ratio * ||vec(W)||_1
>>> + alpha * l1_ratio * ||vec(H)||_1
>>> + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
>>> + 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2
>>>
>>> Where:
>>>
>>> ||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
>>> ||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)
>>>
>>> This seems to suggest that it is optimising over all values in X even if X 
>>> is sparse.   When using NMF for collaborative filtering we need the 
>>> objective function to be defined over only the defined elements of X. The 
>>> remaining elements should effectively be regarded as missing.
>>>
>>>
>>> What is the true objective function NMF is using?
>>>
>>>
>>> Raphael
>>>
>>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Does NMF optimise over observed values

2016-08-28 Thread Raphael C
Reading the docs for
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.NMF.html
it
says

The objective function is:

0.5 * ||X - WH||_Fro^2
+ alpha * l1_ratio * ||vec(W)||_1
+ alpha * l1_ratio * ||vec(H)||_1
+ 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2
+ 0.5 * alpha * (1 - l1_ratio) * ||H||_Fro^2

Where:

||A||_Fro^2 = \sum_{i,j} A_{ij}^2 (Frobenius norm)
||vec(A)||_1 = \sum_{i,j} abs(A_{ij}) (Elementwise L1 norm)

This seems to suggest that it is optimising over all values in X even
if X is sparse.   When using NMF for collaborative filtering we need
the objective function to be defined over only the defined elements of
X. The remaining elements should effectively be regarded as missing.


What is the true objective function NMF is using?


Raphael
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How to get the most important features from a RF efficiently

2016-07-21 Thread Raphael C
The problem was that I had a loop like

for i in xrange(len(clf.feature_importances_)):
print clf.feature_importances_[i]

which recomputes the feature importance array in every step.

Obvious in hindsight.

Raphael


On 21 July 2016 at 16:22, Raphael C <drr...@gmail.com> wrote:
> I have a set of feature vectors associated with binary class labels,
> each of which has about 40,000 features. I can train a random forest
> classifier in sklearn which works well. I would however like to see
> the most important features.
>
> I tried simply printing out forest.feature_importances_ but this takes
> about 1 second per feature making about 40,000 seconds overall. This
> is much much longer than the time needed to train the classifier in
> the first place?
>
> Is there a more efficient way to find out which features are most important?
>
> Raphael
>
> On 21 July 2016 at 15:58, Nelson Liu <nf...@uw.edu> wrote:
>> Hi,
>> If I remember correctly, scikit-learn.org is hosted on GitHub Pages (so the
>> maintainers don't have control over downtime and issues like the one you're
>> having). Can you connect to GitHub, or any site on GitHub Pages?
>>
>> Thanks
>> Nelson
>>
>> On Thu, Jul 21, 2016, 07:52 Rahul Ahuja <rahul.ah...@live.com> wrote:
>>>
>>> Hi there,
>>>
>>>
>>> Sklearn website has been down for couple of days. Please look into it.
>>>
>>>
>>> I reside in Pakistan, Karachi city.
>>>
>>>
>>>
>>>
>>>
>>>
>>> Kind regards,
>>> Rahul Ahuja
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn