[scikit-learn] memory efficient feature extraction

2016-06-06 Thread Roman Yurchak
Dear all,

I was wondering if somebody could advise on the best way for
generating/storing large sparse feature sets that do not fit in memory?
In particular, I have the following workflow,

Large text dataset -> HashingVectorizer -> Feature set in a sparse CSR
array on disk -> Training a classifier -> Predictions

where the the generated feature set is too large to fit in RAM, however
the classifier training can be done in one step (as it uses only certain
rows of the CSR array) and the prediction can be split in several steps,
all of which fit in memory. Since the training can be performed in one
step, I'm not looking for incremental learning out-of-core approaches
and saving features to disk for later processing is definitely useful.

For instance, if it was possible to save the output of the
HashingVectorizer to a single file on disk (using e.g. joblib.dump) then
load this file as a memory map (using e.g. joblib.load(..,
mmap_mode='r')) everything would work great. Due to memory constraints
this cannot be done directly, and the best case scenario is applying
HashingVectorizer on chunks of the dataset, which produces a series of
sparse CSR arrays on disk. Then,
 - concatenation of theses arrays into a single CSR array appears to be
non-tivial given the memory constraints (e.g. scipy.sparse.vstack
transforms all arrays to COO sparse representation internally).
 - I was not able to find an abstraction layer that would allow to
represent these sparse arrays as a single array. For instance, dask
could allow to do this for dense arrays (
http://dask.pydata.org/en/latest/array-stack.html ), however support for
sparse arrays is only planned at this point (
https://github.com/dask/dask/issues/174 ).
  Finally, it is not possible to pre-allocate the full array on disk in
advance (and access it as a memory map) because we don't know the number
of non-zero elements in the sparse array before running the feature
extraction.

  Of course, it is possible to overcome all these difficulties by using
a machine with more memory, but my point is rather to have a memory
efficient workflow.

  I would really appreciate any advice on this and would be happy to
contribute to a project in the scikit-learn environment aiming to
address similar issues,

Thank you,
Best,
-- 
Roman



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] memory efficient feature extraction

2016-06-06 Thread Roman Yurchak
Hi Joel,

thanks for your response.

On 06/06/16 14:29, Joel Nothman wrote:
>  - concatenation of theses arrays into a single CSR array appears to be
> non-tivial given the memory constraints (e.g. scipy.sparse.vstack
> transforms all arrays to COO sparse representation internally).
> 
> There is a fast path for stacking a series of CSR matrices. 
Could you elaborate a bit more? When the final array is larger than the
available memory?

Do you mean something along the lines of,

  1. Load all arrays of the series as memory maps, and calculate the
expected final array shape
  2. Allocate the `data`, `indices` and `indptr` arrays on disk using
either numpy memory map or HDF5
  3. Recalculate `indptr` for each array in the series and fill the 3
resulting arrays
  4. Make sure that we can open these files as a scipy CSR array with
the ability to load only a subset of rows to memory?

I'm just wondering if there is a more standard storage solution in the
scikit-learn environment that could be used efficiently with a
stateless feature extractor (HashingVectorizer) ,

Cheers,
-- 
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


[scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-26 Thread Roman Yurchak
Hi all,

I have a question about using the TruncatedSVD method for performing
Latent Semantic Analysis/Indexing (LSA/LSI). The docs imply that simply
applying TruncatedSVD to a tf-idf matrice is sufficient (cf.
http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html),
but I'm wondering about that.

As far as I understood for LSA one computes a truncated SVD
decomposition of the tf-idf matrix X (n_features x n_samples),
  X ≈ U @ Sigma @ V.T
and then for a document vector d, the projection is computed as,
  d_proj = d.T @ U @ Sigma⁻¹
(source: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf)
However, TruncatedSVD.fit_transform only computes,
  d_proj = d.T @ U
and what's more does not store the singular values (Sigma) internally,
so it cannot be easily applied afterwards.
(the above notation are transposed with respect to those in the scikit
learn docs).

For instance, I have tried reproducing LSA decomposition from literature
and I'm not getting the expected results unless I perform an additional
normalization by the Sigma matrix:
https://gist.github.com/rth/3af30c60bece7db4207821a6dddc5e8d

I was wondering if I am missing something here?
Thank you,
-- 
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Latent Semantic Analysis (LSA) and TrucatedSVD

2016-08-29 Thread Roman Yurchak
Thank you for all your responses!

In the LSA what is equivalent, I think, is
   - to apply a L2 normalization (not the StandardScaler) after the LSA
and then compute the cosine similarity between document vectors simply
as a dot product.
   - not apply the L2 normalization and call the `cosine_similarity`
function instead.

I have applied this normalization to the previous example, and it
produces indeed equivalent results (i.e. does not solve the problem).
Opening an issue on this for further discussion
   https://github.com/scikit-learn/scikit-learn/issues/7283

Thanks for your feedback!
-- 
Roman

On 28/08/16 18:20, Andy wrote:
> If you do "with_mean=False" it should be the same, right?
> 
> On 08/27/2016 12:20 PM, Olivier Grisel wrote:
>> I am not sure this is exactly the same because we do not center the
>> data in the TruncatedSVD case (as opposed to the real PCA case where
>> whitening is the same as calling StandardScaler).
>>
>> Having an option to normalize the transformed data by sigma seems like
>> a good idea but we should probably not call that whitening.
>>
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Confidence Estimation for Regressor Predictions

2016-09-01 Thread Roman Yurchak
I'm also interested to know if there are any projects similar to
scikit-learn-contrib/forest-confidence-interval for linear_model or SVM
regressors.

In the general case, I think you could get a quick first order
approximation of the confidence interval for your regressor, if you take
the standard deviation  of predictions obtained by fitting different
subsets of your data using,
 cross_validation.cross_val_score( ).std()
with a fixed set of estimator parameters? Or some multiple of it (e.g.
2*std). Though this will probably not match exactly the mathematical
definition of a confidence interval.
-- 
Roman


On 01/09/16 20:32, Dale T Smith wrote:
> There is a scikit-learn-contrib project with confidence intervals for random 
> forests.
> 
> https://github.com/scikit-learn-contrib/forest-confidence-interval
> 
> 
> __
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
> Capacity Planning
>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com
> 
> -Original Message-
> From: scikit-learn 
> [mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
> Daniel Seeliger via scikit-learn
> Sent: Thursday, September 1, 2016 2:28 PM
> To: scikit-learn@python.org
> Cc: Daniel Seeliger
> Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
> 
> ⚠ EXT MSG:
> 
> Dear all,
> 
> For classifiers I make use of the predict_proba method to compute a Gini 
> coefficient or entropy to get an estimate of how "sure" the model is about an 
> individual prediction.
> 
> Is there anything similar I could use for regression models? I guess for a 
> RandomForest I could simply use the indiviual predictions of each tree in 
> clf.estimators_ and compute a standard deviation but I guess this is not a 
> generic approach I can use for other regressors like the 
> GradientBoostingRegressor or a SVR.
> 
> Thanks a lot for your help,
> Daniel
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
> attachments.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Confidence Estimation for Regressor Predictions

2016-09-01 Thread Roman Yurchak
Dale, I meant for all the methods in scikit.linear_model. Linear
regression is well known, but say for rigde regression that does not
look that simple http://stats.stackexchange.com/a/15417 .
Thanks for mentioning the bootstrap method!

-- 
Roman

On 01/09/16 21:55, Dale T Smith wrote:
> Confidence intervals for linear models are well known - see any statistics 
> book or look it up on Wikipedia. You should be able to calculate everything 
> you need for a linear model just from the information the estimator provides. 
> Note the Rsquared provided by linear_model appears to be what statisticians 
> call the adjusted-Rsquared.
> 
> 
> __
> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science and 
> Capacity Planning
>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com
> 
> 
> -Original Message-
> From: scikit-learn 
> [mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
> Roman Yurchak
> Sent: Thursday, September 1, 2016 3:45 PM
> To: Scikit-learn user and developer mailing list
> Subject: Re: [scikit-learn] Confidence Estimation for Regressor Predictions
> 
> ⚠ EXT MSG:
> 
> I'm also interested to know if there are any projects similar to 
> scikit-learn-contrib/forest-confidence-interval for linear_model or SVM 
> regressors.
> 
> In the general case, I think you could get a quick first order approximation 
> of the confidence interval for your regressor, if you take the standard 
> deviation  of predictions obtained by fitting different subsets of your data 
> using,
>  cross_validation.cross_val_score( ).std() with a fixed set of estimator 
> parameters? Or some multiple of it (e.g.
> 2*std). Though this will probably not match exactly the mathematical 
> definition of a confidence interval.
> --
> Roman
> 
> 
> On 01/09/16 20:32, Dale T Smith wrote:
>> There is a scikit-learn-contrib project with confidence intervals for random 
>> forests.
>>
>> https://github.com/scikit-learn-contrib/forest-confidence-interval
>>
>>
>> __
>> Dale Smith | Macy's Systems and Technology | IFS eCommerce | Data Science 
>> and Capacity Planning
>>  | 5985 State Bridge Road, Johns Creek, GA 30097 | dale.t.sm...@macys.com
>>
>> -Original Message-
>> From: scikit-learn 
>> [mailto:scikit-learn-bounces+dale.t.smith=macys@python.org] On Behalf Of 
>> Daniel Seeliger via scikit-learn
>> Sent: Thursday, September 1, 2016 2:28 PM
>> To: scikit-learn@python.org
>> Cc: Daniel Seeliger
>> Subject: [scikit-learn] Confidence Estimation for Regressor Predictions
>>
>> ⚠ EXT MSG:
>>
>> Dear all,
>>
>> For classifiers I make use of the predict_proba method to compute a Gini 
>> coefficient or entropy to get an estimate of how "sure" the model is about 
>> an individual prediction.
>>
>> Is there anything similar I could use for regression models? I guess for a 
>> RandomForest I could simply use the indiviual predictions of each tree in 
>> clf.estimators_ and compute a standard deviation but I guess this is not a 
>> generic approach I can use for other regressors like the 
>> GradientBoostingRegressor or a SVR.
>>
>> Thanks a lot for your help,
>> Daniel
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or 
>> opening attachments.
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> * This is an EXTERNAL EMAIL. Stop and think before clicking a link or opening 
> attachments.
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue with sklearn.neural_network

2016-09-09 Thread Roman Yurchak
Ibrahim, I believe the sklearn.neural_network.MLPClassifier was added in
the not yet released v0.18 (current dev version),
http://scikit-learn.org/dev/modules/neural_networks_supervised.html
-- 
Roman
On 09/09/16 10:19, Ibrahim Dalal via scikit-learn wrote:
> Dear Developers,
> 
> I am using sklearn version 0.17.1 on Ubuntu 14.04.
> 
> I was checking out neural network examples and one such example used
> sklearn.neural_network.MLPClassifier. When I tried this, I get the
> following error:
> 
 from sklearn import neural_network
 clf = neural_network.MLPClassifier()
> Traceback (most recent call last):
>   File "", line 1, in 
> AttributeError: 'module' object has no attribute 'MLPClassifier'
> 
> Thanks
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Why does sci-kit learn's hashingvectorizer give negative values?

2016-10-01 Thread Roman Yurchak
On 01/10/16 15:34, Moyi Dang wrote:
> However, I don't understand why the negatives are there in the first
> place, or what they mean. I'm not sure if the absolute values are
> corresponding to the token counts.
> 
> Can someone please help explain what the HashingVectorizer is doing? How
> do I get the HashingVectorizer to return token counts?

Hi Moyi,

it's a mechanism to compensate for hash collisions, see
https://github.com/scikit-learn/scikit-learn/issues/7513 The absolute
values are token counts for most practical applications (if you don't
have too many collisions).  There will be a PR shortly to make this more
consistent.


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] hierarchical clustering

2016-11-04 Thread Roman Yurchak
Hi Jaime,

Alternatively, in scikit learn I think, you could use
   hac = AgglomerativeClustering(n_clusters, linkage="ward")
   hac.fit(data)
   clusters = hac.labels_
there in an example on how to plot a dendrogram from this in
   https://github.com/scikit-learn/scikit-learn/pull/3464

AgglomerativeClustering internally calls scikit learn's version of
cut_tree. I would be curious to know whether this is equivalent to
scipy's fcluster.

Roman

On 03/11/16 23:12, Jaime Lopez Carvajal wrote:
> Hi Juan,
> 
> The fcluster function was that I needed. I can now proceed from here to
> classify images. 
> Thank you very much, 
> 
> Jaime
> 
> On Thu, Nov 3, 2016 at 5:00 PM, Juan Nunez-Iglesias  > wrote:
> 
> Hi Jaime,
> 
> From /Elegant SciPy/:
> 
> """
> The *fcluster* function takes a linkage matrix, as returned by
> linkage, and a threshold, and returns cluster identities. It's
> difficult to know a-priori what the threshold should be, but we can
> obtain the appropriate threshold for a fixed number of clusters by
> checking the distances in the linkage matrix.
> 
> from scipy.cluster.hierarchy import fcluster
> n_clusters = 3
> threshold_distance = (Z[-n_clusters, 2] + Z[-n_clusters+1, 2]) / 2
> clusters = fcluster(Z, threshold_distance, 'distance')
> 
> """
> 
> As an aside, I imagine this question is better placed in the SciPy
> mailing list than scikit-learn (which has its own hierarchical
> clustering API).
> 
> Juan.
> 
> On Fri, Nov 4, 2016 at 2:16 AM, Jaime Lopez Carvajal
> mailto:jalop...@gmail.com>> wrote:
> 
> Hi there,
> 
> I am trying to do image classification using hierarchical
> clustering.
> So, I have my data, and apply this steps:
> 
> from scipy.cluster.hierarchy import dendrogram, linkage
> 
> data1 = np.array(data) 
> Z = linkage(data, 'ward')
> dendrogram(Z, truncate_mode='lastp',  p=12,
> show_leaf_counts=False, leaf_rotation=90.,
> leaf_font_size=12.,show_contracted=True)
> plt.show()
> 
> So, I can see the dendrogram with 12 clusters as I want, but I
> dont know how to use this to classify the image.
> Also, I understand that funtion cluster.hierarchy.cut_tree(Z,
> n_clusters), that cut the tree at that number of clusters, but
> again I dont know how to procedd from there. I would like to
> have something like: cluster = predict(Z, instance) 
> 
> Any advice or direction would be really appreciate, 
> 
> Thanks in advance, Jaime
> 
> 
> -- 
> /*Jaime Lopez Carvajal
> */
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> 
> -- 
> /*Jaime Lopez Carvajal
> */
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Specifying exceptions to ParameterGrid

2016-11-23 Thread Roman Yurchak
Hi Jaidev,

well, `param_grid` in GridSearchCV can also be a list of dictionaries,
so you could directly specify the cases you are interested in (instead
of the full grid - exceptions), which might be simpler?

On 23/11/16 11:15, Jaidev Deshpande wrote:
> Hi,
> 
> Sometimes when using GridSearchCV, I realize that in the grid there are
> certain combinations of hyperparameters that are either incompatible or
> redundant. For example, when using an MLP, if I specify the following grid:
> 
> grid = {'solver': ['sgd', 'adam'], 'learning_rate': ['constant',
> 'invscaling', 'adaptive']}
> 
> then it yields the following ParameterGrid:
> 
> [{'learning_rate': 'constant', 'solver': 'sgd'},
>  {'learning_rate': 'constant', 'solver': 'adam'},
>  {'learning_rate': 'invscaling', 'solver': 'sgd'},
>  {'learning_rate': 'invscaling', 'solver': 'adam'},
>  {'learning_rate': 'adaptive', 'solver': 'sgd'},
>  {'learning_rate': 'adaptive', 'solver': 'adam'}]
> 
> Now, three of these are redundant, since learning_rate is used only for
> the sgd solver. Ideally I'd like to specify these cases upfront, and for
> that I have a simple hack
> (https://github.com/jaidevd/jarvis/blob/master/jarvis/cross_validation.py#L38).
> Using that yields a ParameterGrid as follows:
> 
> [{'learning_rate': 'constant', 'solver': 'adam'},
>  {'learning_rate': 'invscaling', 'solver': 'adam'},
>  {'learning_rate': 'adaptive', 'solver': 'adam'}]
> 
> which is then simply removed from the original ParameterGrid.
> 
> I wonder if there's a simpler way of doing this. Would it help if we had
> an additional parameter (something like "grid_exceptions") in
> GridSearchCV, which would remove these dicts from the list of parameters?
> 
> Thanks
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Specifying exceptions to ParameterGrid

2016-11-25 Thread Roman Yurchak
On 24/11/16 09:00, Jaidev Deshpande wrote:
> 
> well, `param_grid` in GridSearchCV can also be a list of dictionaries,
> so you could directly specify the cases you are interested in (instead
> of the full grid - exceptions), which might be simpler?
> 
> 
> Actually now that I think of it, I don't know if it will be necessarily
> simpler. What if I have a massive grid and only few exceptions?
> Enumerating the complement of that small subset would be much more
> expensive than specifying the exceptions.
The solution indicated by Raghav is most concise if that works for you.

Otherwise, in general, if you want to define the parameters as the full
grid with a few exceptions, without changing the GirdSearchCV API, you
could always try something like,

```
from sklearn.model_selection import GridSearchCV, ParameterGrid
from sklearn.neural_network import MLPClassifier

grid_full = {'solver': ['sgd', 'adam'],
 'learning_rate': ['constant', 'invscaling', 'adaptive']}

def exception_handler(args):
# custom function shaping the domain of valid parameters
if args['solver'] == 'adam' and args['learning_rate'] != 'constant':
return False
else:
return True

def wrap_strings(args):
# all values of dicts provided to GridSearchCV must be lists
return {key: [val] for key, val in args.items()}

grid_tmp = filter(exception_handler, ParameterGrid(grid_full))
grid = [wrap_strings(el) for el in grid_tmp]

gs = GridSearchCV(MLPClassifier(random_state=42),
  param_grid=grid)
```
That's quite similar to what you were suggesting in the original post.
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Query Regarding Model Scoring using scikit learn's joblib library

2016-12-27 Thread Roman Yurchak
Hi Debu,

On 27/12/16 08:18, Andrew Howe wrote:
>  5. I got a prediction result with True Positive Rate (TPR) as 10-12
> % on probability thresholds above 0.5

Getting a high True Positive Rate (recall) is not a sufficient condition
for a well behaved model. Though 0.1 recall is still pretty bad. You
could look at the precision at the same time (or consider, for instance,
the F1 score).

>  7. I reloaded the model in a different python instance from the
> pickle file mentioned above and did my scoring , i.e., used
> joblib library load method and then instantiated prediction
> (predict_proba method) on the entire set of my original 600 K
> records 
>   Another question – is there an alternate model scoring
> library (apart from joblib, the one I am using) ?

Joblib is not a scoring library; once you load a model from disk with
joblib you should get ~ the same RandomForestClassifier estimator object
as before saving it.

>  8. Now when I am running (scoring) my model using
> joblib.predict_proba on the entire set of original data (600 K),
> I am getting a True Positive rate of around 80%. 

That sounds normal, considering what you are doing. Your entire set
consists of 80% of training set (for which the recall, I imagine, would
be close to 1.0) and 20 %  test set (with a recall of 0.1), so on
average you would get a recall close to 0.8 for the complete set. Unless
I missed something.


>  9. I did some  further analysis and figured out that during the
> training process, when the model was predicting on the test
> sample of 120K it could only predict 10-12% of 120K data beyond
> a probability threshold of 0.5. When I am now trying to score my
> model on the entire set of 600 K records, it appears that the
> model is remembering some of it’s past behavior and data and
> accordingly throwing 80% True positive rate

It feels like your RandomForestClassifier is not properly tuned. A
recall of 0.1 on the test set is quite low. It could be worth trying to
tune it better (cf. https://stackoverflow.com/a/36109706 ), using some
other metric than the recall to evaluate the performance.


Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Roc curve from multilabel classification has slope

2017-01-08 Thread Roman Yurchak
José, I might be misunderstanding something, but wouldn't it make more
sens to plot one ROC curve for every class in your result (using all
samples at once), as opposed to plotting it for every training sample as
you are doing now? Cf the example below,

http://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html

Roman

On 08/01/17 01:42, Jacob Schreiber wrote:
> Slope usually means there are ties in your predictions. Check your
> dataset to see if you have repeated predicted values (possibly 1 or 0).
> 
> On Sat, Jan 7, 2017 at 4:32 PM, José Ismael Fernández Martínez
> mailto:ismael...@ciencias.unam.mx>> wrote:
> 
> But is not a scikit-learn classifier, is a keras classifier which,
> in the functional API, predict returns probabilities.
> What I don't understand is why my plot of the roc curve has a slope,
> since I call roc_curve passing the actual label as y_true and the
> output of the classifier (score probabilities) as y_score for every
> element tested.
> 
> 
> 
> Sent from my iPhone
> On Jan 7, 2017, at 4:04 PM, Joel Nothman  > wrote:
> 
>> predict method should not return probabilities in scikit-learn
>> classifiers. predict_proba should.
>>
>> On 8 January 2017 at 07:52, José Ismael Fernández Martínez
>> mailto:ismael...@ciencias.unam.mx>>
>> wrote:
>>
>> Hi, I have a multilabel classifier written in Keras from which
>> I want to compute AUC and plot a ROC curve for every element
>> classified from my test set.
>>
>> 
>>
>> Everything seems fine, except that some elements have a roc
>> curve that have a slope as follows:
>>
>> enter image description here
>> I don't know how to
>> interpret the slope in such cases.
>>
>> Basically my workflow goes as follows, I have a
>> pre-trained |model|, instance of Keras, and I have the
>> features |X| and the binarized labels |y|, every element
>> in |y| is an array of length 1000, as it is a multilabel
>> classification problem each element in |y| might contain many
>> 1s, indicating that the element belongs to multiples classes,
>> so I used the built-in loss of |binary_crossentropy| and my
>> outputs of the model prediction are score probailities. Then I
>> plot the roc curve as follows.
>>
>>
>> The predict method returns probabilities, as I'm using the
>> functional api of keras.
>>
>> Does anyone knows why my roc curves looks like this?
>>
>>
>> Ismael
>>
>>
>>
>> Sent from my iPhone
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org 
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>>
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org 
>> https://mail.python.org/mailman/listinfo/scikit-learn
>> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> 
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Error while using GridSearchCV.

2017-03-07 Thread Roman Yurchak
Shubham,

the definition of ShuffleSplit.__init__ is
ShuffleSplit(n_splits=10, test_size=0.1, train_size=None,
random_state=None)
you are passing the n_split parameter twice (once named and once as the
first parameter), as the exception that you getting says,

-- 
Roman

On 07/03/17 14:24, Shubham Singh Tomar wrote:
> Hi,
> 
> I'm trying to use GridSearchCV to tune the parameters for
> DecisionTreeRegressor. I'm using sklearn 0.18.1
> 
> I'm getting the following error:
> 
> ---
> TypeError Traceback (most recent call last)
>  in () 1 # Fit the training data to 
> the model using grid search>
> 2reg = fit_model(X_train, y_train)3 4 # Produce the value for
> 'max_depth'5 print "Parameter 'max_depth' is {} for the optimal
> model.".format(reg.get_params()['max_depth'])
> in fit_model(X, y) 11 12 # Create cross-validation sets from the
> training data---> 13cv_sets = ShuffleSplit(X.shape[0], n_splits = 10,
> test_size = 0.20, random_state = 0)14 15 # TODO: Create a decision tree
> regressor objectTypeError: __init__() got multiple values for keyword
> argument 'n_splits'
> 
> 
> 
> 
> -- 
> *Thanks,*
> *Shubham Singh Tomar*
> *Autodidact24.github.io *
> 
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] best way to scale on the random forest for text w bag of words ...

2017-03-16 Thread Roman Yurchak
If you run out of memory at the prediction step, splitting the test 
dataset in batches, then concatenating the results should work fine. Why 
would it "skew" the results?


70GB RAM seems huge: for comparison here is some categorization 
benchmarks on a 700k text dataset, that use more in the order of 5-10 GB 
RAM,

https://github.com/FreeDiscovery/FreeDiscovery/issues/58
though with fairly short documents, for other algorithms and with a 
smaller training set.


You could also try reducing the size of your dictionary with hashing.
If you really want to use random forest and have memory constraints, you 
might want to use n_jobs=1 to avoid memory copies,


https://www.quora.com/Why-is-scikit-learns-random-forest-using-so-much-memory

But as Joel was saying, random forest might not the best choice for huge 
sparse arrays; NaiveBayes, LogisticRegression or SVM could be better 
suited, or gradient boosting if you want to go that way...



On 16/03/17 02:44, Joel Nothman wrote:

Trees are not a traditional choice for bag of words models, but you
should make sure you are at least using the parameters of the random
forest to limit the size (depth, branching) of the trees.

On 16 March 2017 at 12:20, Sasha Kacanski mailto:skacan...@gmail.com>> wrote:

Hi,
As soon as number of trees and features goes higher, 70Gb of ram is
gone and i am getting out of memory errors.
file size is 700Mb. Dataframe quickly shrinks from 14 to 2 columns
but there is ton of text ...
with 10 estimators and 100 features per word I can't tackle ~900 k
of records ...
Training set, about 15% of data does perfectly fine but when test
come that is it.

i can split stuff and multiprocess it but I believe that will simply
skew results...

Any ideas?


--
Aleksandar Kacanski

___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn





___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How to dump a model to txt file?

2017-04-14 Thread Roman Yurchak
Also, there is an effort on converting trained scikit-learn models to 
other languages (e.g. C) in https://github.com/nok/sklearn-porter

but it does not support GradientBoostingRegressor (yet).

On 13/04/17 23:27, federico vaggi wrote:

If you want to use the model from C++ code, the easiest way is to
probably use Boost/Python
(http://www.boost.org/doc/libs/1_62_0/libs/python/doc/html/index.html).
Alternatively, use another gradient boosting library that has a C++ API
(like XGBoost).

Keep in mind, if you want to call Python code from C++ you will have to
bundle a Python interpreter as well as all the dependencies.

On Thu, 13 Apr 2017 at 14:23 Sebastian Raschka mailto:se.rasc...@gmail.com>> wrote:

Hi,

not sure how this could generally work. However, you could at least
dump the model parameters for e.g., linear models and compute the
prediction via

w_1 * x1 + w_2 * x_2 + … + w_n * x_n + bias

over the n features.

To write various model attributes to text files, you could use json,
e.g., see https://cmry.github.io/notes/serialize
However, I don’t think that this approach will solve the problem of
loading the model into C++.

Best,
Sebastian

> On Apr 13, 2017, at 4:58 PM, 老陈 <26743...@qq.com
> wrote:
>
> Hi,
>
> I am working on GradientBoostingRegressor these days and I am
wondering if there is a way to dump the model into txt file, or any
other format that can be processed by c++
>
> My production system is in c++, so I want use the python-trained
tree model in c++ for production.
>
> Has anyone ever done this before?
>
> thanks
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Machine learning for PU data

2017-06-30 Thread Roman Yurchak

Hello Ruchika,

I don't think that scikit-learn currently has algorithms that can train 
with positive and unlabeled class labels only. However, you could try 
one of the following compatible wrappers,
  - 
http://nktmemo.github.io/jekyll/update/2015/11/07/pu_classification.html

  - https://github.com/scikit-learn/scikit-learn/pull/371

(haven't tried them myself).

Also, you could try one class SVM as suggested here 
https://stackoverflow.com/questions/25700724/binary-semi-supervised-classification-with-positive-only-and-unlabeled-data-set


--
Roman



On 30/06/17 16:06, Ruchika Nayyar wrote:

Hi All,

I am a scikit-learn user and have a question for the community, if
anyone has applied any available machine learning algorithms in the
scikit-learn package for data with positive and unlabeled class only? If
so would you share some insight with me. I understand this could be a
broader topic but I am new to analyzing PU data and hence can use some
help.

Thanks,
Ruchika



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-06-30 Thread Roman Yurchak

Hello Sema,

On 30/06/17 17:14, Sema Atasever wrote:

I want to cluster them using Birch clustering algorithm.
Does this method have 'precomputed' option.


No it doesn't, see 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html 
so you would need to provide it with the original features matrix (not 
the precomputed distance matrix). Since your dataset is fairly small, 
there is no reason in precomputing it anyway.



I needed train an SVM on the centroids of the microclusters so
*How can i get the centroids of the microclusters?*


By "microclusters" do you mean sub-clusters? If you are interested in 
the leaves subclusters see the Birch.subcluster_centers_ parameter.


Otherwise if you want all the centroids in the hierarchy of subclusters, 
you can browse the hierarchical tree via the  Birch.root_ attribute then 
look at _CFSubcluster.centroid_ for each subcluster.


Hope this helps,
--
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-07-03 Thread Roman Yurchak

Hello Sema,

as far as I can tell, in your dataset you has n_samples=65909, 
n_features=539. Clustering high dimensional data is problematic for a 
number of reasons, 
https://en.wikipedia.org/wiki/Clustering_high-dimensional_data#Problems


besides the BIRCH implementation doesn't scale well for n_features >> 50 
(see for instance the discussion in the second part of 
https://github.com/scikit-learn/scikit-learn/pull/8808#issuecomment-300776216 
also in ).


As a workaround for the memory error, you could try using the 
out-of-core version of Birch (using `partial_fit` on chunks of the 
dataset, instead of `fit`) but in any case it might also be better to 
reduce dimensionality beforehand (e.g. with PCA), if that's acceptable. 
Also the threshold parameter may need to be increased: since in your 
dataset it looks like the Euclidean distances are more in the 1-10 range?


--
Roman


On 03/07/17 17:09, Sema Atasever wrote:

Dear Roman,

When I try the code with the original data (*data.dat*) as you
suggested, I get the following error : *Memory Error* --> (*error.png*),
how can i overcome this problem, thank you so much in advance.
​
 data.dat
<https://drive.google.com/file/d/0B4rY6f4kvHeCYlpZOURKNnR0Q1k/view?usp=drive_web>
​

On Fri, Jun 30, 2017 at 5:42 PM, Roman Yurchak mailto:rth.yurc...@gmail.com>> wrote:

Hello Sema,

On 30/06/17 17:14, Sema Atasever wrote:

I want to cluster them using Birch clustering algorithm.
Does this method have 'precomputed' option.


No it doesn't, see
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html

<http://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html>
so you would need to provide it with the original features matrix
(not the precomputed distance matrix). Since your dataset is fairly
small, there is no reason in precomputing it anyway.

I needed train an SVM on the centroids of the microclusters so
*How can i get the centroids of the microclusters?*


By "microclusters" do you mean sub-clusters? If you are interested
in the leaves subclusters see the Birch.subcluster_centers_ parameter.

Otherwise if you want all the centroids in the hierarchy of
subclusters, you can browse the hierarchical tree via the
Birch.root_ attribute then look at _CFSubcluster.centroid_ for each
subcluster.

Hope this helps,
--
Roman
___
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>




___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Construct the microclusters using a CF-Tree

2017-07-06 Thread Roman Yurchak

Hello Sema,

On 05/07/17 13:27, Sema Atasever wrote:

How can i know which cluster member represents best each cluster?


You could try to pick the one that's closest to the cluster centroid..


In the birch code i use this code line: *centroids =
brc.subcluster_centers_*
How do I interpret this line of code output?


It is supposed to give your the centroid of each leaf node (computed in 
https://github.com/scikit-learn/scikit-learn/blob/ab93d65/sklearn/cluster/birch.py#L472). 



I would just recompute the centroid from the labels, though, with
  X[brc.labels_==k, :].mean() for k in np.unique(brc.labels_)
to be sure of the results...

--
Roman

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] How can i write the birch prediction results to the file

2017-08-22 Thread Roman Yurchak

Hello Sema,

On 22/08/17 11:24, Sema Atasever wrote:
> "joblib.dump" produces a file format with npy extension so I can not 
open the file with the notepad editor. I can not see the predictions 
results inside the file.


Is there another way to save the prediction results in text format?


Prediction results are just an array: you could use numpy.savetxt to 
save them in an ascii text format.


--
Roman

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Accessing Clustering Feature Tree in Birch

2017-08-23 Thread Roman Yurchak

> what are the data samples in this cluster

Mehmet's response below works for exploring the hierarchical tree. 
However, Birch currently doesn't store the data samples that belong to a 
given subcluster. If you need that, as far as I know, a reasonable 
approximation can be obtained by computing the data samples that are 
closest to the centroid of the considered subcluster (accessible via 
_CFNode.centroids_) as compared to all other subcluster centroids at 
this hierarchical tree depth.


Alternatively, the modifications in PR 
https://github.com/scikit-learn/scikit-learn/pull/8808 aimed to make 
this process easier..

--
Roman

On 23/08/17 13:44, Suzen, Mehmet wrote:

Hi Sema,

You can access CFNode from the fit output, assign fit output, so you
can have the object.

brc_fit = brc.fit(X)
brc_fit_cfnode = brc_fit.root_


Then you can access CFNode, see here
https://kite.com/docs/python/sklearn.cluster.birch._CFNode

Also, this example comparing mini batch kmeans.
http://scikit-learn.org/stable/auto_examples/cluster/plot_birch_vs_minibatchkmeans.html

Hope this was what you are after.

Best,
Mehmet

On 23 August 2017 at 10:55, Sema Atasever  wrote:

Dear scikit-learn members,

Considering the "CF-tree" data structure :

- How can i access Clustering Feature Tree in Birch?

- For example, how many clusters are there in the hierarchy under the root
node and what are the data samples in this cluster?

- Can I get them separately for 3 trees?

Best.

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] TF-IDF

2017-10-02 Thread Roman Yurchak

Hi Apurva,

if you consider the operations done by the augmented frequency and the 
cosine normalization independently from everything else, they are 
somewhat similar. The normalization by max in a p-norm with p→+∞ . So 
apart from the 0.5 offset, both are can be seen document length 
normalization with a different p value.


However, in TF-IDF you you would typically have an IDF document 
weighting operation between the term frequency weighting and the 
normalization, in which case the effect of both will be quite different. 
Generally I find that the SMART IR notation is very useful to represent 
different phases of the TF-IDF transformation.


The default parameters of TfidfTransformer is a good choice that will 
work well in most cases. Also, depending on the algorithm that you use 
afterwards, not having your data normalized by a an actual norm (e.g. 
cosine) may be sub-optimal.  Still, if you want to fine tune your 
document normalization have a look at the "Pivoted Document Length 
Normalization" paper by Singhal et al. There is a compatible 
implementation of this and a few other TF-IDF schemes in 
http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html


In the end, it's probably easier to try different options on your 
dataset to see what works and what doesn't. You could just determine it 
by cross-validating..


--
Roman

On 27/09/17 13:53, Apurva Nandan wrote:

Hello,

Could anybody tell me the difference between using augmented frequency
(which is used for weighting term frequencies to eliminate the bias
towards larger documents) and cosine normalization (l2 norm which
scikit-learn uses for TfidfTransformer).
Augmented frequency is given by the following equation. It tries to
divide the natural term frequency by the maximum frequency of any term
in the document.

Inline image 1

Do they both do the same thing when it comes to eliminating bias towards
larger documents? I suppose scikit-learn uses the natural term freq, and
using cosine normalization is enabled with using norm=l2

Any help would be appreciated!

- Apurva


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Accessing Clustering Feature Tree in Birch

2017-10-02 Thread Roman Yurchak

Hello,

sklearn.cluster.Birch follows the original BIRCH paper, that appears to 
be mostly focused on efficiently building the hierarchical clustering 
tree (and not so much on making the later analysis user friendly). The 
attributes exposed by Birch are those that could be reasonably exposed 
given the scikit-learn API constraints. Though, one does have access to 
the full cluster hierarchy via the Birch.root_.


As Joel said, traversing the tree is a standard CS problem, and there is 
also probably a number of operations that could be done with it, 
depending on the application. For instance, for my use case, I found 
that re-constructing the Birch hierarchy using a custom container class 
for each subcluster was the easiest to run subsequent analysis with. A 
detailed example can be found here,

http://freediscovery.io/doc/stable/python/examples/birch_cluster_hierarchy.html
Alternatively, I wonder if converting the tree to a format readable by 
some tree/graph specialized library (e.g. networkx) could be useful for 
analysis.


Generally there is a number of places in scikit-learn where trees are 
used (Birch, AgglomerativeClustering, tree bases classifiers, etc) but 
for now there is no way to export the constructed tree to some standard 
format (apart for sklearn.tree.export_graphviz). Not sure if this is 
realistically achievable though..


--
Roman

On 20/09/17 13:40, Sema Atasever wrote:

I need this information to use it in a scientific study and
I think that a function interface would make this easier.

Thank you for your answer.

On Sat, Sep 16, 2017 at 1:53 PM, Joel Nothman mailto:joel.noth...@gmail.com>> wrote:

There is no such thing as "the data samples in this cluster". The
point of Birch being online is that it loses any reference to the
individual samples that contributed to each node, but stores some
statistics on their basis. Roman Yurchak has, however, offered a PR
where, for the non-online case, storage of the indices contributing
to each node can be optionally turned on:
https://github.com/scikit-learn/scikit-learn/pull/8808
<https://github.com/scikit-learn/scikit-learn/pull/8808>

As for finding what is contained under any particular node,
traversing the tree is a fairly basic task from a computer science
perspective. Before we were to support something to make this much
easier, I think we'd need to be clear on what kinds of use case we
were supporting. What do you hope to do with this information, and
what would a function interface look like that would make this much
easier?

Decimals aren't a practical option as the branching factor may be
greater than 10, it is a hard structure to inspect, and susceptible
to computational imprecision. Better off with a list of tuples, but
what for that is not easy enough to do now?



___
scikit-learn mailing list
scikit-learn@python.org <mailto:scikit-learn@python.org>
https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>




___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak

Ismael,

as far as I saw the sklearn.decomposition.PCA doesn't mention scaling at 
all (except for the whiten parameter which is post-transformation scaling).


So since it doesn't mention it, it makes sense that it doesn't do any 
scaling of the input. Same as np.linalg.svd.


You can verify that PCA and np.linalg.svd yield the same results, with

```
>>> import numpy as np
>>> from sklearn.decomposition import PCA
>>> import numpy.linalg
>>> X = np.random.RandomState(42).rand(10, 4)
>>> n_components = 2
>>> PCA(n_components, svd_solver='full').fit_transform(X)
```

and

```
>>> U, s, V = np.linalg.svd(X - X.mean(axis=0), full_matrices=False)
>>> (X - X.mean(axis=0)).dot(V[:n_components].T)
```

--
Roman

On 16/10/17 03:42, Ismael Lemhadri wrote:

Dear all,
The help file for the PCA class is unclear about the preprocessing
performed to the data.
You can check on line 410 here:
https://github.com/scikit-learn/scikit-learn/blob/ef5cb84a/sklearn/decomposition/pca.py#L410

that the matrix is centered but NOT scaled, before performing the
singular value decomposition.
However, the help files do not make any mention of it.
This is unclear for someone who, like me, just wanted to compare that
the PCA and np.linalg.svd give the same results. In academic settings,
students are often asked to compare different methods and to check that
they yield the same results. I expect that many students have confronted
this problem before...
Best,
Ismael Lemhadri


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak

On 16/10/17 17:16, Ismael Lemhadri wrote:

My concern is actually not about not mentioning the scaling but about
not mentioning the centering.
That is, the sklearn PCA removes the mean but it does not mention it in
the help file.


I think it's currently assumed given the definition of the PCA, but you 
are right, the subtraction of the mean and the relationship to the SVD 
decomposition (i.e. TruncatedSVD) could be more clearly stated in the 
docsting and in the user manual,


http://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca

Feel free to open an issue on Github about it or to submit a pull 
request to improve the documentation,


--
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] 1. Re: unclear help file for sklearn.decomposition.pca

2017-10-16 Thread Roman Yurchak

It might be useful to have some of these comments in the docs.

Currently the PCA docsting only states that PCA is computed with SVD and 
then goes on discussing randomized SVD solvers. The user guide is not 
more helpful on this subject either,


Ismael opened a documentation PR on it in 
https://github.com/scikit-learn/scikit-learn/pull/9934


--
Roman

On 16/10/17 21:29, Sebastian Raschka wrote:

Oh, never mind my previous email, because while the components should be
the same, the projection of the data points onto those components would
still be affected by centering vs non-centering I guess.

Best,
Sebastian


On Oct 16, 2017, at 3:25 PM, Sebastian Raschka mailto:se.rasc...@gmail.com>> wrote:

Hi,

if you compute the principal components (i.e., eigendecomposition)
from the covariance matrix, it shouldn't matter whether the data is
centered or not, since the covariance matrix is computed as

CovMat = \fact{1}{n} \sum_{i=1}^{n} (x_n - \bar{x}) (x_n - \bar{x})^T

where \bar{x} = vector of feature means

So, if you center the data prior to computing the covariance matrix,
\bar{x} is simply 0.

Best,
Sebastian


On Oct 16, 2017, at 2:27 PM, Ismael Lemhadri mailto:lemha...@stanford.edu>> wrote:

@Andreas Muller:
My references do not assume centering,
e.g. http://ufldl.stanford.edu/wiki/index.php/PCA
any reference?



On Mon, Oct 16, 2017 at 10:20 AM, mailto:scikit-learn-requ...@python.org>> wrote:

Send scikit-learn mailing list submissions to
scikit-learn@python.org <mailto:scikit-learn@python.org>

To subscribe or unsubscribe via the World Wide Web, visit
https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>
or, via email, send a message with subject or body 'help' to
scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>

You can reach the person managing the list at
scikit-learn-ow...@python.org
<mailto:scikit-learn-ow...@python.org>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."


Today's Topics:

   1. Re: unclear help file for sklearn.decomposition.pca
  (Andreas Mueller)


--

Message: 1
Date: Mon, 16 Oct 2017 13:19:57 -0400
From: Andreas Mueller mailto:t3k...@gmail.com>>
To: scikit-learn@python.org <mailto:scikit-learn@python.org>
Subject: Re: [scikit-learn] unclear help file for
sklearn.decomposition.pca
Message-ID: <04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com
<mailto:04fc445c-d8f3-a3a9-4ab2-0535826a2...@gmail.com>>
Content-Type: text/plain; charset="utf-8"; Format="flowed"

The definition of PCA has a centering step, but no scaling step.

On 10/16/2017 11:16 AM, Ismael Lemhadri wrote:
> Dear Roman,
> My concern is actually not about not mentioning the scaling but
about
> not mentioning the centering.
> That is, the sklearn PCA removes the mean but it does not
mention it
> in the help file.
> This was quite messy for me to debug as I expected it to either: 1/
> center and scale simultaneously or / not scale and not center
either.
> It would be beneficial to explicit the behavior in the help
file in my
> opinion.
> Ismael
>
> On Mon, Oct 16, 2017 at 8:02 AM,
mailto:scikit-learn-requ...@python.org>
> <mailto:scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>>> wrote:
>
> Send scikit-learn mailing list submissions to
> scikit-learn@python.org <mailto:scikit-learn@python.org>
<mailto:scikit-learn@python.org <mailto:scikit-learn@python.org>>
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>
> <https://mail.python.org/mailman/listinfo/scikit-learn
<https://mail.python.org/mailman/listinfo/scikit-learn>>
> or, via email, send a message with subject or body 'help' to
> scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>
> <mailto:scikit-learn-requ...@python.org
<mailto:scikit-learn-requ...@python.org>>
>
> You can reach the person managing the list at
> scikit-learn-ow...@python.org
<mailto:scikit-learn-ow...@python.org>
<mailto:scikit-learn-ow...@python.org
<mailto:scikit-learn-ow...@python.org>>
>
> When replying, please edit your Subject line so it is more
   

Re: [scikit-learn] Text classification of large dataet

2017-12-20 Thread Roman Yurchak

Ranjana,

have a look at this example 
http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html


Since you have a lot of RAM, you may not need to make all the 
classification pipeline out-of-core, a start with your current code 
could be to write a generator that loads and pre-processes the text in 
chunks then feed it one document at the time to CountVecotorizer.fit (it 
accepts an iterable). To reduce the memory usage, filtering too frequent 
tokens (instead of the infrequent ones) could help too. Make sure you L2 
normalize your data before the classifier. You could use 
SGDClassifier(loss='log') or LogisticRegression with a sag or saga 
solver. The multiclasss="multinomial" parameter might be also worth 
trying, particularly since you have so many classes.


--
Roman

On 19/12/17 15:38, Ranjana Girish wrote:

Hai all,

I am doing text classification. I have around 10 million data to be
classified to around 7k category.

Below is the code I am using

/# Importing the libraries/
/i*mport pandas as pd*/
/*import nltk*/
/*from nltk.corpus import stopwords*/
/*from nltk.tokenize import word_tokenize*/
/*from nltk.stem.wordnet import WordNetLemmatizer*/
/*from nltk.stem.porter import PorterStemmer*/
/*import re*/
/*from sklearn.feature_extraction.text import CountVectorizer*/
/*import random*/
/*from sklearn.naive_bayes import MultinomialNB,GaussianNB*/
/*from sklearn.metrics import accuracy_score*/
/*from sklearn.metrics import precision_recall_curve*/
/*from sklearn.metrics import average_precision_score*/
/*from sklearn import feature_selection*/
/*from scipy.sparse import csr_matrix*/
/*from scipy import sparse*/
/*import sys*/
/*from sklearn import preprocessing*/
/*import numpy as np*/
/*import pickle*/
/* */
/*sys.setrecursionlimit(2)*/
/*
*/
/*random.seed(2)*/
/*
*/
/*
*/
/*trainset1=pd.read_csv("trainsetgrt500sample10.csv",encoding =
"ISO-8859-1")*/
/*trainset2=pd.read_csv("trainsetlessequal500.csv",encoding =
"ISO-8859-1")*/
/*
*/
/*dataset=pd.concat([trainset1,trainset2])*/
/*
*/
/*dataset=dataset.dropna()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[^a-zA-Z]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('[\d]',
' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.lower()*/
/*
*/
/*del trainset1*/
/*del trainset2  */
/*
*/
/*stop = stopwords.words('english')*/
/*lemmatizer = WordNetLemmatizer()*/
/*
*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace(r'\b('
+ r'|'.join(stop) + r')\b\s*', ' ')*/
/*dataset['ProductDescription']=dataset['ProductDescription'].str.replace('\s\s+','
')*/
/*dataset['ProductDescription']
=dataset['ProductDescription'].apply(word_tokenize)*/
/*ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'*/
/*POS_LIST = [NOUN, VERB, ADJ, ADV]*/
/*for tag in POS_LIST:*/
/*dataset['ProductDescription'] =
dataset['ProductDescription'].apply(lambda x:
list(set([lemmatizer.lemmatize(item,tag) for item in x])))*/
/*dataset['ProductDescription']=dataset['ProductDescription'].apply(lambda
x : " ".join(x))*/
/*
*/
/*countvec = CountVectorizer(min_df=0.8)*/
/*documenttermmatrix=countvec.fit_transform(dataset['ProductDescription'])*/
/*documenttermmatrix.shape*/
/*column=countvec.get_feature_names()*/
/*filename1 = 'columnnamessample10mastermerge.sav'*/
/*pickle.dump(column, open(filename1, 'wb'))*/
/*
*/
/*y_train=dataset['classpath']*/
/*y_train=dataset['classpath'].tolist()*/
/*labels_train= preprocessing.LabelEncoder()*/
/*labels_train.fit(y_train)*/
/*y1_train=labels_train.transform(y_train)*/
/*
*/
/*del dataset*/
/*del countvec*/
/*del column*/
/*
*/
/*
*/
/*clf = MultinomialNB()*/
/*model=clf.fit(documenttermmatrix,y_train)*/
/*
*/
/*
*/
/*
*/
*
*
/*
*/
/*filename2 = 'modelnaivebayessample10withfs.sav'*/
/*pickle.dump(model, open(filename2, 'wb'))*/
/
/
/
/
I am using system with *128 GB RAM.*

As I was unable to train all 10 million data, I did *stratified
sampling* and the trainset reduced to 2.3 million

Still I was unable to Train  2.3 million data

I got*memory error* when i used *random forest (nestimator=30),**Naive
Bayes* and *SVM*


/
/
*I have stucked*
*
*
*
*

*Can Anyone please tell whether any memory leak in my code and  how to
use system with 128 GB RAM effectively*


Thanks
Ranjana



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] CountVectorizer: Additional Feature Suggestion

2018-01-30 Thread Roman Yurchak

Hi Yacine,

On 29/01/18 16:39, Yacine MAZARI wrote:
 >> I wouldn't hate if length normalisation was added to 
 if it was shown that normalising before IDF 
multiplication was more effective than (or complementary >> to) norming 
afterwards.

I think this is one of the most important points here.
Though not a formal proof, I can for example refer to:

  * NLTK
,
which is using document-length-normalized term frequencies.

  * Manning and Schütze's Introduction to Information Retrieval

:
"The same considerations that led us to prefer weighted
representations, in particular length-normalized tf-idf
representations, in Chapters 6   7 also apply here." 


I believe the conclusion of the Manning's Chapter 6 is the following 
table with TF-IDF weighting schemes 
https://nlp.stanford.edu/IR-book/html/htmledition/document-and-query-weighting-schemes-1.html 
in which the document length normalization is applied _after_ the IDF. 
So "length-normalized tf-idf" is just TfidfVectorizer with norm='l1' as 
previously mentioned (at least, if you measure the document length as 
the number of words it contains).
More generally a weighting & normalization transformer for some of the 
other configurations in that table is implemented in


http://freediscovery.io/doc/stable/python/generated/freediscovery.feature_weighting.SmartTfidfTransformer.html

With respect to the NLTK implementation, see 
https://github.com/nltk/nltk/pull/979#issuecomment-102296527


So I don't think there is a need to change anything in TfidfTransformer...

--
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Multi learn error.

2018-05-04 Thread Roman Yurchak

Hi Aijaz,

On 05/05/18 07:31, aijaz qazi wrote:
> Dear developers of Scikit ,

Scikit is short for SciPy Toolkits (https://www.scipy.org/scikits.html); 
there is a number of those. Scikit-learn started as one (and this is the 
scikit-learn mailing list).


The package you are refering is based on scikit-learn but is a separate 
project (with a somewhat confusing home page URL). The right place to 
ask for support would be its Github issue tracker or other specific 
communcations channels if it has any.


--
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] DBScan freezes my computer !!!

2018-05-13 Thread Roman Yurchak
Could you please check memory usage while running DBSCAN to make sure 
freezing is due to running out of memory and not to something else?
Which parameters do you run DBSCAN with? Changing algorithm, leaf_size 
parameters and ensuring n_jobs=1 could help.


Assuming eps is reasonable, I think it shouldn't be an issue to run 
DBSCAN on L2 normalized data: using the default euclidean metric, this 
should produce somewhat similar results to clustering not normalized 
data with metric='cosine'.


On 13/05/18 00:20, Andrew Nystrom wrote:
If you’re l2 norming your data, you’re making it live on the surface of 
a hypershere. That surface will have a high density of points and may 
not have areas of low density, in which case the entire surface could be 
recognized as a single cluster if epsilon is high enough and min 
neighbors is low enough. I’d suggest not using l2 norm with DBSCAN.
On Sat, May 12, 2018 at 7:27 AM Mauricio Reis > wrote:


The DBScan "fit" method (in scikit-learn v0.19.1) is freezing my
computer without any warning message!

I am using WinPython 3.6.5 64 bit.

The method works normally with the original data, but freezes when I
use the normalized data (between 0 and 1).

What should I do?

Att.,
Mauricio Reis
___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Error

2018-05-21 Thread Roman Yurchak
Try opening an issue at their Github issue tracker 
https://github.com/scikit-multilearn/scikit-multilearn/issues ; 
providing a detailed description of the issue takes some time but would 
also make it more likely to get an answer there (see 
https://stackoverflow.com/help/mcve).


--
Roman

On 21/05/18 11:33, aijaz qazi wrote:

Dev of scikit multilearn is not responding at all.



/*Regards,*/
/*Aijaz A.Qazi */

On Mon, May 21, 2018 at 2:47 PM, Guillaume Lemaître 
mailto:g.lemaitr...@gmail.com>> wrote:


check with the dev of scikit multilearn directly.

Sent from my phone - sorry to be brief and potential misspell.

*From:* aqsdm...@gmail.com 
*Sent:* 21 May 2018 11:12 am
*To:* scikit-learn@python.org 
*Reply to:* scikit-learn@python.org 
*Subject:* [scikit-learn] Error


Scikit Multilearn  does not work.




/*Regards,*/
/*Aijaz A.Qazi */

___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn





___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Update or downgrade PCA

2018-07-03 Thread Roman Yurchak

Hi Pamphile,

On 03/07/18 10:41, Pamphile Roy wrote:
I have some code that allows to upgrade (or downgrade) a PCA with a new 
sample.
The update part is handy when you are doing live observations for 
instance and you want a quick way to update your PCA without having to 
recompute the whole thing from scratch.

> [..]
> [1] M. Brand: Fast low-rank modifications of the thin singular value 
decomposition.


Do you know how this  would compare with 
sklearn.decomposition.IncrementalPCA ?


--
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn website and documentation

2019-09-02 Thread Roman Yurchak

Hello Chiara,

as far as I understood scikit-learn#14849 started as an incremental 
improvement of the scikit-learn website and ended up as a more in depth 
rewrite of the sphinx theme.


If you have any comments or suggestions don't hesitate to comment on 
that issue. For instance, that PR went with Boostrap and I'm wondering 
about be the advantages/limitations with respect to using something like 
PureCSS.


Reviews of that PR would also be very much appreciated.

--
Roman

On 30/08/2019 18:58, Chiara Marmo wrote:

Hello,

Should I consider this PR [1] as an answer? ;)

Cheers,
Chiara

[1] https://github.com/scikit-learn/scikit-learn/pull/14849


On Sat, Aug 24, 2019 at 1:53 PM Chiara Marmo > wrote:


Hi Nicolas,

Working on visual and contents of the the docs is in my skills and
I'm happy to finish the job.
But I'm not a web designer and I don't like to impose myself... :)

Maybe you can check at the Monday meeting if everybody is ok with
that and write down comments in the minutes? For the next meeting I
will be available for collecting specifications, if any.

Gaël, I will check purecss.io : how much
customization the basic theme needs has to be considered too.

CiaoCiao

Chiara


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Vote on SLEP009: keyword only arguments

2019-09-16 Thread Roman Yurchak
+1 assuming we are careful about continuing to allow some frequently 
used positional arguments, even in __init__.


For instance,

n_components = 10
pca = PCA(n_components)

is still more readable, I think, than,

pca = PCA(n_components=n_components)


--
Roman

On 15/09/2019 00:21, Thomas J Fan wrote:

+1 from me

On Sat, Sep 14, 2019 at 8:12 AM Joel Nothman > wrote:


I am +1 for this change.

I agree that users will accommodate the syntax sooner or later.

On Fri., 13 Sep. 2019, 7:54 pm Jeremie du Boisberranger,
mailto:jeremie.du-boisberran...@inria.fr>> wrote:

I don't know what is the policy about a sklearn 1.0 w.r.t api
changes.

If it's meant to be a special release with possible api changes
without deprecation cycles, I think this change is a good
candidate for 1.0


Otherwise I'm +1 and agree with Guillaume, people will get used
to it by using it.

Jérémie



On 12/09/2019 10:06, Guillaume Lemaître wrote:

To the question: do we want to utilise Python 3's
force-keyword-argument syntax
and to change existing APIs which support arguments
positionally to use this
syntax, via a deprecation period?

I am +1.

IMO, even if the syntax might be unknown, it will remain
unknown until projects
from the ecosystem are not using it.

To the question: which methods should be impacted?

I think we should be as gentle as possible at first. I am a
little concerned about
breaking some codes which were working fine before.

On Thu, 12 Sep 2019 at 04:43, Joel Nothman
mailto:joel.noth...@gmail.com>> wrote:

These there details of specific API changes to be decided:

The question being put, as per the SLEP, is:
do we want to utilise Python 3's force-keyword-argument syntax
and to change existing APIs which support arguments
positionally to use this syntax, via a deprecation period?
___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn



-- 
Guillaume Lemaitre

INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/

___
scikit-learn mailing list
scikit-learn@python.org  
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] logistic regression results are not stable between solvers

2019-10-09 Thread Roman Yurchak

Ben,

I can confirm your results with penalty='none' and C=1e9. In both cases, 
you are running a mostly unpenalized logisitic regression. Usually 
that's less numerically stable than with a small regularization, 
depending on the data collinearity.


Running that same code with
 - larger penalty ( smaller C values)
 - or larger number of samples
 yields for me the same coefficients (up to some tolerance).

You can also see that SAGA convergence is not good by the fact that it 
needs 196000 epochs/iterations to converge.


Actually, I have often seen convergence issues with SAG on small 
datasets (in unit tests), not fully sure why.


--
Roman

On 09/10/2019 22:10, serafim loukas wrote:

The predictions across solver are exactly the same when I run the code.
I am using 0.21.3 version. What is yours?


In [13]: import sklearn

In [14]: sklearn.__version__
Out[14]: '0.21.3'


Serafeim



On 9 Oct 2019, at 21:44, Benoît Presles > wrote:


(y_pred_lbfgs==y_pred_saga).all() == False



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] scikit-learn twitter account

2019-11-05 Thread Roman Yurchak
Maybe re-purposing? I'm not sure if people find useful the current 
approach of a tweet per PR.

It would make things less confusing to have 1 account.

Looking how other OSS projects do this would also be interesting.

On 05/11/2019 06:14, Andreas Mueller wrote:

Should we re-purpose the existing twitter account or make a new one?
https://twitter.com/scikit_learn

We do have 6k followers already!

On 11/4/19 3:08 PM, Nelle Varoquaux wrote:

I think that's a good idea as well!

On Mon, 4 Nov 2019 at 15:06, Chiara Marmo > wrote:


Be reassured Gael... no support via twitter... :)
Just a way to centralize messages and reach people that ping to
show that scikit-learn cares.

On Mon, Nov 4, 2019 at 2:04 PM Gael Varoquaux
mailto:gael.varoqu...@normalesup.org>> wrote:

On Mon, Nov 04, 2019 at 05:41:31PM +0530, Siddharth Gupta wrote:
> Would be good for the users to have a social media account
to reach out to.

I do not think that the point is to do support, but outreach.

Gaël


___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Monthly meetings

2019-11-13 Thread Roman Yurchak

Thanks for the reminder!

Is there a way to put these periodic meetings in a calendar (either in 
some shared calendar or as calendar invitations for people who are 
likely to participate/were there last time) ?


Cheers,

Roman
On 13/11/2019 23:14, Nicolas Hug wrote:

Hey everyone,

The next monthly meeting is on Monday!

As usual, please be nice to the NYC people and *update your project 
notes before Friday* it'll be 7am for us :)



Cheers,

Nicolas


https://github.com/scikit-learn/scikit-learn/projects/15 



https://appear.in/amueller 
 



https://www.timeanddate.com/worldclock/meetingdetails.html?year=2019&month=11&day=18&hour=12&min=0&sec=0&p1=240&p2=33&p3=37&p4=179&p5=195




___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Vote on SLEP010: n_features_in_ attribute

2019-12-06 Thread Roman Yurchak

On 04/12/2019 20:44, Joel Nothman wrote:
I am +1 for this, but I think we should look at how to make these new 
validation methods usable by external developers


+1 for the SLEP and for finding a way to make this method usable by 
external developers maybe as part of the developer API.

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Update of 'Upcoming events' on the scikit-learn wiki

2019-12-06 Thread Roman Yurchak

Thank you, Chiara!

I think announcing some of the main planned sprints on the mailing list 
and twitter would be helpful. Last sprint (in London) contributors were 
interested in knowing how they could follow when next sprints would 
happen, and we didn't have a clear answer then (short of following all 
discussions on the mailing list).


+1 also to link on wiki to scikit-learn sprints organized by other 
organizations.


--
Roman

On 05/12/2019 10:56, Chiara Marmo wrote:

Dear core-devs,

I would like to advertise about our Paris sprint (end of January) on the 
scikit-learn wiki.
If there are no objections, my goal is to make a list of events in the 
'Upcoming events' page [1] and link the event pages from there.
This will allow to link also events organised by other entities (like 
WiMLDS) even if pages are not hosted there.


Please, let me know if you all are ok with that.

Thanks for listening,
best

Chiara

[1] https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] transfer learning doubt

2020-03-19 Thread Roman Yurchak

On 19/03/2020 14:19, Farzana Anowar wrote:
> Another option is to us deep learning and store the weights for the 
first model and initialize the second model with that weight and keep 
doing it for the rest of the models.


This can also be done in scikit-learn with models that support 
warm_start=True init parameter (including SGDClassifier).


Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Analysis of sklearn and other python libraries on github by MS team

2020-03-27 Thread Roman Yurchak

Very interesting! A few comments,

> From GH17, we managed to extract only 10.5k pipelines.  The 
relatively low frequency (with respect to the number of notebooks using 
SCIKIT-LEARN [..]) indicates a non-wide adoption of this specification. 
However, the number of pipelines in the GH19 corpus is 132k pipelines 
(i.e., an increase of 13× [..] since 2017).


It's nice to see that pipelines are indeed widely used.

> Top-5 transformers [from imports] in GH19 are StandardScaler, 
CountVectorizer, TfidfTransformer, PolynomialFeatures, TfidfVectorizer 
(in this order).  Same are the results for GH17 with the difference that 
PCA is instead of TfidfVectorizer.


Hmm, I would have expected OneHotEncoder somewhere at the top and much 
less text processing. If there is real usage of CountVectorizer and 
TfidfTransformer separately, then maybe deprecating TfidfVectorizer 
could be done https://github.com/scikit-learn/scikit-learn/issues/14951 
Though this ranking looks quite unexpected. I wonder if they have the 
full list and not just the top5.


> Regarding learners, Top-5 in both GH17 and GH19 are 
LogisticRegression, MultinomialNB, SVC, LinearRegression, and 
RandomForestClassifier (in this order).


Maybe LinearRegression docstring should more strongly suggest to use 
Ridge with small regularization in practice.


--
Roman

On 27/03/2020 17:32, Andreas Mueller wrote:

Hey all.
There's a pretty cool paper by a team at MS that analyses public github 
repos for their use of the sklearn and related libraries:

https://arxiv.org/abs/1912.09536

Thought it might be of interest.

Cheers,
Andy
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Vote: Add Adrin Jalali to the scikit-learn technical committee

2020-04-27 Thread Roman Yurchak

+1

On 27/04/2020 15:20, Jeremie du Boisberranger wrote:

+1

On 27/04/2020 15:18, Nicolas Hug wrote:

+1

On 4/27/20 9:16 AM, Gael Varoquaux wrote:

+1

And thank you very much Adrin!

On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote:

Hi All.
Given all his recent contributions, I want to nominate Adrin Jalali 
to the

Technical Committee:
https://scikit-learn.org/stable/governance.html#technical-committee
According to the governance document, this will require a discussion 
and

vote.
I think we can move to the vote immediately unless someone objects.
Thanks for all your work Adrin!
Cheers,
Andy
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Voting software

2020-04-27 Thread Roman Yurchak
BTW, could we use some online voting software for votes? Just to avoid 
filling public email threads with +1s. For instance CPython uses 
https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does 
anyone know a simple non anonymous one preferably linked to Github 
authentication?


On 27/04/2020 15:18, Nicolas Hug wrote:

+1

On 4/27/20 9:16 AM, Gael Varoquaux wrote:

+1

And thank you very much Adrin!

On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote:

Hi All.
Given all his recent contributions, I want to nominate Adrin Jalali 
to the

Technical Committee:
https://scikit-learn.org/stable/governance.html#technical-committee
According to the governance document, this will require a discussion 
and

vote.
I think we can move to the vote immediately unless someone objects.
Thanks for all your work Adrin!
Cheers,
Andy
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Voting software

2020-04-28 Thread Roman Yurchak
True, but ideally it would need to be something more voting oriented 
that cannot be modified later on and archives a history of past decisions.


On 27/04/2020 17:09, Hermes Morales wrote:

https://doodle.com/es/ is not bad

Obtener Outlook para Android <https://aka.ms/ghei36>


*From:* scikit-learn 
 on behalf of 
Roman Yurchak 

*Sent:* Monday, April 27, 2020 10:30:49 AM
*To:* Scikit-learn user and developer mailing list 
*Subject:* Re: [scikit-learn] Voting software
BTW, could we use some online voting software for votes? Just to avoid
filling public email threads with +1s. For instance CPython uses
https://www.python.org/dev/peps/pep-8001/ but it is anonymous. Does
anyone know a simple non anonymous one preferably linked to Github
authentication?

On 27/04/2020 15:18, Nicolas Hug wrote:

+1

On 4/27/20 9:16 AM, Gael Varoquaux wrote:

+1

And thank you very much Adrin!

On Mon, Apr 27, 2020 at 09:12:02AM -0400, Andreas Mueller wrote:

Hi All.
Given all his recent contributions, I want to nominate Adrin Jalali 
to the

Technical Committee:
https://scikit-learn.org/stable/governance.html#technical-committee
According to the governance document, this will require a discussion 
and

vote.
I think we can move to the vote immediately unless someone objects.
Thanks for all your work Adrin!
Cheers,
Andy
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn



___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] imbalanced datasets return uncalibrated predictions - why?

2020-11-17 Thread Roman Yurchak

On 17/11/2020 09:57, Sole Galli via scikit-learn wrote:
And I understand that it has to do with the cost function, because if we 
re-balance the dataset with say class_weight = 'balance'. then the 
probabilities seem to be calibrated as a result.


As far I know, logistic regression will have well calibrated 
probabilities even in the imbalanced case. However, with the default 
decision threshold at 0.5, some of the infrequent categories may never 
be predicted since their probability is too low.


If you use  class_weight = 'balanced' the probabilities will no longer 
be well calibrated, however you would predict some of those infrequent 
categories.


See discussions in 
https://github.com/scikit-learn/scikit-learn/issues/10613 and linked issues.


--
Roman
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Issue in BIRCH clustering algo

2021-02-11 Thread Roman Yurchak
It's a known issue, see 
https://github.com/scikit-learn/scikit-learn/issues/17966

Someone would would need to investigate more to find a fix though.
If you have a minimal reproducible example that's different from the one 
in that issue, and could post it there it would help.


Roman

On 11/02/2021 17:29, Farzana Anowar wrote:

Hello everyone,

I was trying to run the BIRCH clustering algorithm. However, after 
fitting the model I am facing the following error:


AttributeError: '_CFSubcluster' object has no attribute 'sq_norm_'

This error occurs only after fitting the model and I couldn't find any 
proper explanation of this. Could anyone give me any suggestions on 
that? It would be really helpful.


Here is my code:

from sklearn.cluster import Birch

# Creating the BIRCH clustering model
model = Birch(n_clusters = None)

# Fit the data (Training)
model.fit(df)

# Predict the same data
pred = model.predict(df)


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] [ANN] scikit-learn 0.24.2 is online!

2021-04-28 Thread Roman Yurchak

Thanks for making this bug fix release happen!

--
Roman

On 28/04/2021 19:23, Guillaume Lemaître wrote:

scikit-learn 0.24.2 is out on pypi.org  and conda-forge!

This is a small maintenance release that fixes a couple of regressions:

https://scikit-learn.org/stable/whats_new/v0.24.html#version-0-24-2 



You can upgrade with pip as usual:

|pip install -U scikit-learn |

The conda-forge builds will be available shortly, which you can then 
install using:


|conda install -c conda-forge scikit-learn |


Thanks again to all the contributors and let's work towards version 1.0!
On behalf of the scikit-learn maintainer team.
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/ 

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] sklearn-porter support

2021-05-04 Thread Roman Yurchak

Hi Joe,

sklearn-porter is a nice project, however people on this mailing list 
are not really involved with its development.


You would likely get more relevant answers to your questions by asking 
the author directly, for instance in a Github issue. I'm sure they would 
appreciate an offer to help with the maintenance.


Roman

On 04/05/2021 23:00, Joe Geisbauer wrote:

Hello,

I was wondering if the sklearn-porter repo was still being maintained. I went 
today to look when a known issue might be fixed and noticed no work had been 
performed in over a year on the repo? Is the work waiting on dev resources or 
are there other concerns keeping work from entering the repo? I would be happy 
to look into helping with the repo if the concerns are merely developer time.

Thank you,
Joe Geisbauer
___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] New core dev: Julien Jerphanion

2021-10-30 Thread Roman Yurchak

Congratulations, Julian, and thank for all your work!

Roman

On 30/10/2021 11:18, Guillaume Lemaître wrote:
The scikit-learn core development team has welcomed a new member, Julien 
Jerphanion, who has contributed code, reviews, and documentation since 
this March (aside from occasional contributions in the past).


Congratulation and welcome Julien!

On the behalf of the scikit-learn team
--
Guillaume Lemaitre
Scikit-learn @ Inria Foundation
https://glemaitre.github.io/ 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] What is the ECCN (Export Control Classification Number) number & COO (Country of Origin) of scikit-learn

2022-01-10 Thread Roman Yurchak

Hi Anup,

as far as I know scikit-learn is not export controlled. It has no 
components that belong to that classification.
You can check yourself though the lists provided for instance in links 
of this blog post  https://www.magicsplat.com/blog/ear/index.html to 
determine it.


Though I'm not a lawyer, and it's indeed an interesting legal question 
with scikit-learn (similarly to other open-source projects) being hosted 
at Github which does enforce those to some extent 
https://docs.github.com/en/github/site-policy/github-and-trade-controls


You could also contact Tidelift which should be able to provide this 
information for the projects they sponsor, though maybe only for 
companies who subscribe to it ) For instance, see discussion in 
https://forum.tidelift.com/t/export-control-classification-number/312



Roman

On 10/01/2022 16:12, Anup Arun Yadav via scikit-learn wrote:

Hi Adrin,

We have some Qualification process of all software / libraries we used, 
in that we have to mention ECCN and COO.  This process is called Trade 
and Customs Compliance (“TCC”) in that we have to ensure that there’s no 
any policy violation against the use of any library.


*Thanks & Regards,*

Anup Yadav.

   Fullstack developer – II -- PITC

   ===

   Schlumberger

   Schlumberger Tech. Centre Pvt. Ltd.

   4^th Floor – Building No 8.

   Survey No. 144/145, Samrat Ashok Path

   Commerzone, Yerawada, Pune

   Maharashtra, India – 411 006

*From:*Adrin [mailto:adrin.jal...@gmail.com]
*Sent:* 10 January 2022 20:35
*To:* Scikit-learn mailing list 
*Cc:* Anup Arun Yadav 
*Subject:* [Ext] Re: [scikit-learn] What is the ECCN (Export Control 
Classification Number) number & COO (Country of Origin) of scikit-learn


Hi,

What's exactly the reason you need this information?

Best,

Adrin

On Mon., Jan. 10, 2022, 14:51 Anup Arun Yadav via scikit-learn, 
mailto:scikit-learn@python.org>> wrote:


Hey Team,

I’ve subscribed but don’t know from where to post, please send URL
also Please let me know what is the ECCN (Export Control
Classification Number) number & COO (Country of Origin) of scikit-learn.

*Thanks & Regards,*

Anup Yadav.

   Fullstack developer – II -- PITC

   ===

   Schlumberger

   Schlumberger Tech. Centre Pvt. Ltd.

   4^th Floor – Building No 8.

   Survey No. 144/145, Samrat Ashok Path

   Commerzone, Yerawada, Pune

   Maharashtra, India – 411 006

Schlumberger-Private

___
scikit-learn mailing list
scikit-learn@python.org 
https://mail.python.org/mailman/listinfo/scikit-learn



Schlumberger-Private


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
Hi Matthieu,

if you are interested in general questions regarding improving 
scikit-learn performance, you might be want to have a look at the draft 
roadmap
https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 -- 
there is a lot topics where suggestions / PRs on improving performance 
would be very welcome.

For the particular case of TfidfVectorizer, it is a bit different from 
the rest of the scikit-learn code base in the sense that it's not 
limited by the performance of numerical calculation but rather that of 
string processing and counting. TfidfVectorizer is equivalent to 
CountVectorizer + TfidfTransformer and the later  has only a marginal 
computational cost. As to CountVectorizer, last time I checked, its 
profiling was something along the lines of,
  - part regexp for tokenization (see token_pattern.findall)
  - part token counting (see CountVectorizer._count_vocab)
  - and a comparable part for all the rest

Because of that, porting it to Cython is not that immediate, as one is 
still going to use CPython regexp and token counting in a dict. For 
instance, HashingVectorizer implements token counting in Cython -- it's 
faster but not that much faster. Using C++ maps or some less common 
structures have been discussed in 
https://github.com/scikit-learn/scikit-learn/issues/2639

Currently, I think, there are ~3 main ways performance could be improved,
  1. Optimize the current implementation while remaining in Python. 
Possible but IMO would require some effort, because there are not much 
low hanging fruits left there. Though a new look would definitely be good.

  2. Parallelize computations. There was some earlier discussion about 
this in scikit-learn issues, but at present, the better way would 
probably be to add it in dask-ml (see 
https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already 
supported. Someone would need to implement CountVectorizer.

  3. Rewrite part of the implementation in a lower level language (e.g. 
Cython). The question is how maintainable that would be, and whether the 
performance gains would be worth it.  Now that Python 2 will be dropped, 
at least not having to deal with Py2/3 compatibility for strings in 
Cython might make things a bit easier. Though, if the processing is in 
Cython it might also make using custom tokenizers/analyzers more difficult.

On a related topic, I have been experimenting with implementing part 
of this processing in Rust lately: 
https://github.com/rth/text-vectorize. So far it looks promising. 
Though, of course, it will remain a separate project because of language 
constraints in scikit-learn.

In general if you have thoughts on things that can be improved, don't 
hesitate to open issues,
-- 
Roman


On 25/11/2018 10:59, Matthieu Brucher wrote:
> Hi all,
> 
> I've noticed a few questions online (mainly SO) on TfidfVectorizer 
> speed, and I was wondering about the global effort on speeding up sklearn.
> Is there something I can help on this topic (Cython?), as well as a 
> discussion on this tough subject?
> 
> Cheers,
> 
> Matthieu
> -- 
> Quantitative analyst, Ph.D.
> Blog: http://blog.audio-tk.com/
> LinkedIn: http://www.linkedin.com/in/matthieubrucher


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Recurrent questions about speed for TfidfVectorizer

2018-11-26 Thread Roman Yurchak via scikit-learn
Tries are interesting, but it appears that while they use less memory 
that dicts/maps they are generally slower than dicts for a large number 
of elements. See e.g. 
https://github.com/pytries/marisa-trie/blob/master/docs/benchmarks.rst. 
This is also consistent with the results in the below linked 
CountVectorizer PR that aimed to use tries, I think.

Though maybe e.g. MARISA-Trie (and generally trie libraries available in 
python) did improve significantly in 5 years since 
https://github.com/scikit-learn/scikit-learn/issues/2639 was done.

The thing is also that even HashingVecorizer that doesn't need to handle 
the vocabulary is only a moderately faster, so using a better data 
structure for the vocabulary might give us its performance at best..

-- 
Roman

On 26/11/2018 16:f28, Andreas Mueller wrote:
> I think tries might be an interesting datastructure, but it really
> depends on where the bottleneck is.
> I'm really surprised they are not used more, but maybe that's just
> because implementations are missing?
> 
> On 11/26/18 8:39 AM, Roman Yurchak via scikit-learn wrote:
>> Hi Matthieu,
>>
>> if you are interested in general questions regarding improving
>> scikit-learn performance, you might be want to have a look at the draft
>> roadmap
>> https://github.com/scikit-learn/scikit-learn/wiki/Draft-Roadmap-2018 --
>> there is a lot topics where suggestions / PRs on improving performance
>> would be very welcome.
>>
>> For the particular case of TfidfVectorizer, it is a bit different from
>> the rest of the scikit-learn code base in the sense that it's not
>> limited by the performance of numerical calculation but rather that of
>> string processing and counting. TfidfVectorizer is equivalent to
>> CountVectorizer + TfidfTransformer and the later  has only a marginal
>> computational cost. As to CountVectorizer, last time I checked, its
>> profiling was something along the lines of,
>> - part regexp for tokenization (see token_pattern.findall)
>> - part token counting (see CountVectorizer._count_vocab)
>> - and a comparable part for all the rest
>>
>> Because of that, porting it to Cython is not that immediate, as one is
>> still going to use CPython regexp and token counting in a dict. For
>> instance, HashingVectorizer implements token counting in Cython -- it's
>> faster but not that much faster. Using C++ maps or some less common
>> structures have been discussed in
>> https://github.com/scikit-learn/scikit-learn/issues/2639
>>
>> Currently, I think, there are ~3 main ways performance could be improved,
>> 1. Optimize the current implementation while remaining in Python.
>> Possible but IMO would require some effort, because there are not much
>> low hanging fruits left there. Though a new look would definitely be good.
>>
>> 2. Parallelize computations. There was some earlier discussion about
>> this in scikit-learn issues, but at present, the better way would
>> probably be to add it in dask-ml (see
>> https://github.com/dask/dask-ml/issues/5). HashingVectorizer is already
>> supported. Someone would need to implement CountVectorizer.
>>
>> 3. Rewrite part of the implementation in a lower level language (e.g.
>> Cython). The question is how maintainable that would be, and whether the
>> performance gains would be worth it.  Now that Python 2 will be dropped,
>> at least not having to deal with Py2/3 compatibility for strings in
>> Cython might make things a bit easier. Though, if the processing is in
>> Cython it might also make using custom tokenizers/analyzers more difficult.
>>
>>   On a related topic, I have been experimenting with implementing part
>> of this processing in Rust lately:
>> https://github.com/rth/text-vectorize. So far it looks promising.
>> Though, of course, it will remain a separate project because of language
>> constraints in scikit-learn.
>>
>> In general if you have thoughts on things that can be improved, don't
>> hesitate to open issues,
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Next Sprint

2018-12-22 Thread Roman Yurchak via scikit-learn
That works for me as well.

On 21/12/2018 16:00, Olivier Grisel wrote:
> Ok for me. The last 3 weeks of February are fine for me.
> 
> Le jeu. 20 déc. 2018 à 21:21, Alexandre Gramfort 
> mailto:alexandre.gramf...@inria.fr>> a écrit :
> 
> ok for me
> 
> Alex
> 
> On Thu, Dec 20, 2018 at 8:35 PM Adrin  > wrote:
>  >
>  > It'll be the least favourable week of February for me, but I can
> make do.
>  >
>  > On Thu, 20 Dec 2018 at 18:45 Andreas Mueller  > wrote:
>  >>
>  >> Works for me!
>  >>
>  >> On 12/19/18 5:33 PM, Gael Varoquaux wrote:
>  >> > I would propose  the week of Feb 25th, as I heard people say
> that they
>  >> > might be available at this time. It is good for many people,
> or should we
>  >> > organize a doodle?
>  >> >
>  >> > G
>  >> >
>  >> > On Wed, Dec 19, 2018 at 05:27:21PM -0500, Andreas Mueller wrote:
>  >> >> Can we please nail down dates for a sprint?
>  >> >> On 11/20/18 2:25 PM, Gael Varoquaux wrote:
>  >> >>> On Tue, Nov 20, 2018 at 08:15:07PM +0100, Olivier Grisel wrote:
>  >>  We can also do Paris in April / May or June if that's ok
> with Joel and better
>  >>  for Andreas.
>  >> >>> Absolutely.
>  >> >>> My thoughts here are that I want to minimize transportation,
> partly
>  >> >>> because flying has a large carbon footprint. Also, for
> personal reasons,
>  >> >>> I am not sure that I will be able to make it to Austin in
> July, but I
>  >> >>> realize that this is a pretty bad argument.
>  >> >>> We're happy to try to host in Paris whenever it's most
> convenient and to
>  >> >>> try to help with travel for those not in Paris.
>  >> >>> Gaël
>  >> >>> ___
>  >> >>> scikit-learn mailing list
>  >> >>> scikit-learn@python.org 
>  >> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>  >> >> ___
>  >> >> scikit-learn mailing list
>  >> >> scikit-learn@python.org 
>  >> >> https://mail.python.org/mailman/listinfo/scikit-learn
>  >>
>  >> ___
>  >> scikit-learn mailing list
>  >> scikit-learn@python.org 
>  >> https://mail.python.org/mailman/listinfo/scikit-learn
>  >
>  > ___
>  > scikit-learn mailing list
>  > scikit-learn@python.org 
>  > https://mail.python.org/mailman/listinfo/scikit-learn
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] VOTE: scikit-learn governance document

2019-02-11 Thread Roman Yurchak via scikit-learn
+1 as well

Roman

On 11/02/2019 09:47, Gael Varoquaux wrote:
> +1 on my side too.
> 
> Thanks a lot Andy for moving this forward.
> 
> Gaël
> 
> On Mon, Feb 11, 2019 at 07:53:51AM +, Vlad Niculae wrote:
>> +1
> 
>> Thank you for the effort to formalize this!
> 
>> Best,
>> Vlad
> 
>> On Mon, Feb 11, 2019, 02:47 Noel Dawe  
>>  Hi Andy,
> 
>>  +1 from me as well :)
> 
>>  On Sun, Feb 10, 2019 at 8:54 PM Jacob Schreiber 
>> 
>>  wrote:
> 
>>  +1 from me as well. Thanks for putting in the time to write this all
>>  out.
> 
>>  On Sun, Feb 10, 2019 at 4:54 PM Hanmin Qin 
>>  wrote:
> 
>>  +1 (personally I still think it's better to keep the flow 
>> chart, it
>>  seems useful for beginners)
> 
>>  Hanmin Qin
> 
>>  - Original Message -
>>  From: Alexandre Gramfort 
>>  To: Scikit-learn mailing list 
>>  Subject: Re: [scikit-learn] VOTE: scikit-learn governance 
>> document
>>  Date: 2019-02-11 01:29
> 
>>  +1 for me too
> 
>>  Alex
> 
> 
>>  On Sat, Feb 9, 2019 at 10:06 PM Gilles Louppe 
>> 
>>  wrote:
> 
>>  Hi Andy,
> 
>>  I read through to document. Even though I have not been 
>> really
>>  active
>>  these past months/years, I think it summarizes well our
>>  governance
>>  model.
> 
>>  +1.
> 
>>  Gilles
> 
>>  On Sat, 9 Feb 2019 at 12:01, Adrin 
>>  wrote:
> 
>>  > +1
> 
>>  > Thanks for the work you've put in it!
> 
>>  > On Sat, Feb 9, 2019, 03:00 Andreas Mueller 
>> >  wrote:
> 
>>  >> Hey all.
> 
>>  >> I want to call a vote on the final version on the
>>  scikit-learn
>>  >> governance document, which can be found in this PR:
> 
>>  >> https://github.com/scikit-learn/scikit-learn/pull/12878
> 
>>  >> It underwent some significant changes in the last couple 
>> of
>>  weeks.
> 
>>  >> The two-sentence summary is: conflicts are resolved by 
>> vote
>>  among core
>>  >> devs, with a technical committee resolving anything that 
>> can
>>  not be
>>  >> decided by at least a 2/3 majority. The initial technical
>>  committee is
>>  >> Alexander Gramfort, Olivier Grisel, Joel Nothman, Hanmin
>>  Qin, Gaël
>>  >> Varoquaux and myself (Andreas Müller).
> 
>>  >> I would ask all of the *core developers* to either vote 
>> +1
>>  for the
>>  >> governance doc, -1 against it, or to explicitly abstain 
>> here
>>  on the
>>  >> public mailing list (which is the way any vote will be
>>  conducted
>>  >> according to the new governance document).
> 
>>  >> I suggest we leave the vote open for two weeks, so that 
>> the
>>  decision is
>>  >> made before the sprint and we can take actions.
> 
>>  >> Anyone can still comment on the PR or here, though I 
>> would
>>  rather not
>>  >> make more changes as this has already been discussed to 
>> some
>>  length.
> 
>>  >> Thank you for participating,
> 
>>  >> Andy
> 
>>  >> ___
>>  >> scikit-learn mailing list
>>  >> scikit-learn@python.org
>>  >> https://mail.python.org/mailman/listinfo/scikit-learn
> 
>>  > ___
>>  > scikit-learn mailing list
>>  > scikit-learn@python.org
>>  > https://mail.python.org/mailman/listinfo/scikit-learn
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>  https://mail.python.org/mailman/listinfo/scikit-learn
> 
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>  https://mail.python.org/mailman/listinfo/scikit-learn
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>  https://mail.python.org/mailman/listinfo/scikit-learn
> 
>>  ___
>>  scikit-learn mailing list
>>  scikit-learn@python.org
>>

Re: [scikit-learn] Sprint discussion points?

2019-02-19 Thread Roman Yurchak via scikit-learn
Thanks for putting the draft schedule together!

Personally I will be there 3 days out of 5 and wouldn't want to miss the 
discussion on euclidean distance issues. Maybe we could adjust the 
schedule during the sprint (say on Tuesday) based on people's interest 
and availability? That might be easier than trying to figure it out for 
29 participants over email..

Also IMO it would makes sense to have some discussions (that are not 
that controversial or about high level API but still useful) earlier 
during the week to be able to work on them during the sprint.

-- 
Roman

On 20/02/2019 02:30, Joel Nothman wrote:
> I don't think I'll be able to stay for the Friday 10am discussion, but 
> have a PR open on "efficient grid search" so should probably be involved.
> 
> Perhaps the fit_transform discussion can happen without you, Andy?
> 
> On Wed, 20 Feb 2019 at 10:17, Andreas Mueller  > wrote:
> 
> I put a draft schedule here:
> 
> https://github.com/scikit-learn/scikit-learn/wiki/Upcoming-events#technical-discussions-schedule
> 
> it's obviously somewhat opinionated ;)
> Happy to reprioritize.
> Basically I wouldn't like to miss any of the big API discussions
> because coming late to the party.
> 
> The two things on (grid?) searches are somewhat related: one is
> about specifying search-spaces, the other about executing a given
> search space efficiently. They probably warrant separate discussions.
> 
> I haven't added plotting or sample props on it, which are maybe two
> other discussion points.
> I tried to cover most controversial things from the roadmap.
> 
> Not sure if discussing the schedule via the mailing list is the best
> way? Don't want to create even more traffic  than I already am ;)
> 
> On 2/19/19 5:48 PM, Guillaume Lemaître wrote:
>> > Not sure if Guillaume had ideas about the schedule, given that
>> he seems to be running the show?
>>
>> Mostly running behind the show ...
>>
>> For the moment, we only have a 30 minutes presentation of
>> introduction planned on Monday.
>> For the rest of the week, this is pretty much opened and I think
>> that we can propose a schedule such that we can be efficient.
>> IMO, two meetings of an hour per day look good to me.
>>
>> Shall we prioritize the list of the issues? Maybe that some issues
>> could be packed together.
>> I would not be against having a rough schedule on the wiki
>> directly and I think that having it before Monday could be better.
>>
>> Let me know how I can help.
>>
>> On Tue, 19 Feb 2019 at 22:23, Andreas Mueller > > wrote:
>>
>> Yeah, sounds good.
>> I didn't want to unilaterally post a schedule, but doing some
>> google form or similar seems a bit heavy-handed?
>> Not sure if Guillaume had ideas about the schedule, given that
>> he seems to be running the show?
>>
>> On 2/19/19 4:17 PM, Joel Nothman wrote:
>>> I don't think optics requires a large meeting, just a few
>>> people.
>>>
>>> I'm happy with your proposal generally, Andy. Do we schedule
>>> specific topics at this point?
>>>
>>> ___
>>> scikit-learn mailing list
>>> scikit-learn@python.org  
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org 
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>>
>>
>> -- 
>> Guillaume Lemaitre
>> INRIA Saclay - Parietal team
>> Center for Data Science Paris-Saclay
>> https://glemaitre.github.io/
>>
>> ___
>> scikit-learn mailing list
>> scikit-learn@python.org  
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] API Discussion: Where shall we put the plotting functions?

2019-04-03 Thread Roman Yurchak via scikit-learn
+1 for options 1 and +0.5 for 3. Do we anticipate that many plotting 
functions will be added? If it's just a dozen or less, putting them all 
into a single namespace sklearn.plot might be easier.

This also would avoid discussion about where to put some generic 
plotting functions (e.g. 
https://github.com/scikit-learn/scikit-learn/issues/13448#issuecomment-478341479).

Roman

On 03/04/2019 12:06, Trevor Stephens wrote:
> I think #1 if any of these... Plotting functions should hopefully be as 
> general as possible, so tagging with a specific type of estimator will, 
> in some scikit-learn utopia, be unnecessary.
> 
> If a general plotter is built, where does it live in other 
> estimator-specific namespace options? Feels awkward to put it under 
> every estimator's namespace.
> 
> Then again, there might be a #4 where there is no plot module and 
> plotting classes live under groups of utilities like introspection, 
> cross-validation or something?...
> 
> On Wed, Apr 3, 2019 at 8:54 PM Andrew Howe  > wrote:
> 
> My preference would be for (1). I don't think the sub-namespace in
> (2) is necessary, and don't like (3), as I would prefer the plotting
> functions to be all in the same namespace sklearn.plot.
> 
> Andrew
> 
> <~~~>
> J. Andrew Howe, PhD
> LinkedIn Profile 
> ResearchGate Profile 
> Open Researcher and Contributor ID (ORCID)
> 
> Github Profile 
> Personal Website 
> I live to learn, so I can learn to live. - me
> <~~~>
> 
> 
> On Tue, Apr 2, 2019 at 3:40 PM Hanmin Qin  > wrote:
> 
> See https://github.com/scikit-learn/scikit-learn/issues/13448
> 
> We've introduced several plotting functions (e.g., plot_tree and
> plot_partial_dependence) and will introduce more (e.g.,
> plot_decision_boundary) in the future. Consequently, we need to
> decide where to put these functions. Currently, there're 3
> proposals:
> 
> (1) sklearn.plot.plot_YYY (e.g., sklearn.plot.plot_tree)
> 
> (2) sklearn.plot.XXX.plot_YYY (e.g., sklearn.plot.tree.plot_tree)
> 
> (3) sklearn.XXX.plot.plot_YYY (e.g.,
> sklearn.tree.plot.plot_tree, note that we won't support from
> sklearn.XXX import plot_YYY)
> 
> Joel Nothman, Gael Varoquaux and I decided to post it on the
> mailing list to invite opinions.
> 
> Thanks
> 
> Hanmin Qin
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> ___
> scikit-learn mailing list
> scikit-learn@python.org 
> https://mail.python.org/mailman/listinfo/scikit-learn
> 


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Starting to contribute

2019-04-07 Thread Roman Yurchak via scikit-learn
Hello Heitor,

yes, you can chose an issue, comment there that you plan to work on it 
(to avoid redundant work by other contributors) and if no one objects 
make a PR. If you have any questions you can ask them by commenting on 
that issue (for specific questions) or on the scikit-learn Gitter 
https://gitter.im/scikit-learn/scikit-learn (for general questions about 
how to contribute).

See https://scikit-learn.org/stable/developers/contributing.html for 
more information.

Roman

On 06/04/2019 19:07, Heitor Boschirolli wrote:
> Hello!
> 
> First of all, I'm apologize if this email is not for such questions, but 
> I never contributed to open source code before and I'm not sure how to 
> proceed, could someone help me with that?
> 
> Should I just pick an issue, solve it following the guidelines described 
> in the website and open a PR?
> If I have any trouble, can I post it on the mailing list?
> 
> Att, Heitor Boschirolli


___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] Normalization in ridge regression when there is no intercept

2019-06-07 Thread Roman Yurchak via scikit-learn
On 06/06/2019 14:56, ahmetcik wrote:
> I have just recognized that when using ridge regression without an
> intercept no normalization is performed even if the argument "normalize"
> is set to True.

It's a known longstanding issue 
https://github.com/scikit-learn/scikit-learn/issues/3020 It would be 
indeed good to find a solution.

-- 
Roman

___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn


Re: [scikit-learn] titanic dataset, use for book

2019-06-27 Thread Roman Yurchak via scikit-learn
Meanwhile, loading the CSV from OpenML (https://www.openml.org/d/40945) 
would also work,

pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')

-- 
Roman

On 25/06/2019 17:04, Andreas Mueller wrote:
> By the time your book comes out, it's likely to be merged, but might not 
> be released, depending on your timeline.
> It might be easier for your to upload the CSV file to a repository you 
> control yourself.





___
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn