Re: [Scikit-learn-general] Pickling custom Transformers in a Pipeline

2016-04-05 Thread Fred Mailhot
dreas Mueller wrote: > What's the type of self.custom? > > Also, you can step into the debugger to see which function it is that can > not be pickled. > > > > > On 04/05/2016 04:14 PM, Fred Mailhot wrote: > > Hi all, > > I've got a pipeline with some

[Scikit-learn-general] Pickling custom Transformers in a Pipeline

2016-04-05 Thread Fred Mailhot
Hi all, I've got a pipeline with some custom transformers that's not pickling, and I'm not sure why. I've had this previously when using custom preprocessors & tokenizers with CountVectorizers. I dealt with it then by defining the custom bits at the module level. I assumed I could avoid that by c

Re: [Scikit-learn-general] Announcing lightning v0.1

2016-03-25 Thread Fred Mailhot
I imagine a lot of people might be interested in this, but be in a position where they need to justify bringing in a new package that mimics sklearn, rather than just using the linear models that are already available there. Could you day a but more about how/why this is better? Thanks! Fred. On M

Re: [Scikit-learn-general] Subclassing vectorizers

2016-03-23 Thread Fred Mailhot
e > estimator (with any fitted model discarded) in constructing ensembles, > cross validation, etc. While none of the scikit-learn library of estimators > do this, in practice you can overload get_params to define your own > parameter listing. See > http://scikit-learn.org/stable/devel

[Scikit-learn-general] Subclassing vectorizers

2016-03-22 Thread Fred Mailhot
Hello list, Firstly, thanks for this incredible package; I use it daily at work. Now on to the meat: I'm trying to subclass TfidfVectorizer and running into issues. I want to specify an extra param for __init__() that points to a file that gets used in build_analyzer(). Skipping irrelevant bits, I

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Fred Mailhot
t, only space between >> terms. >> >> Best, >> Ehsan >> >> >> On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot >> wrote: >> >>> Have you checked that your other program tokenizes the same way as the >>> default sklearn tokeniza

Re: [Scikit-learn-general] [TfidfVectorizer problem]

2015-11-19 Thread Fred Mailhot
Have you checked that your other program tokenizes the same way as the default sklearn tokenization? On 19 November 2015 at 11:09, Ehsan Asgari wrote: > Hi, > > Thank you, but it didn't work. > I checked len(tf.vocabulary_) and it is also 1900 instead of 1914. > I have another program that cou

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Fred Mailhot
://www.google.com/patents/US9037464 > Filed on 15 March 2013 > > On Thu, Jul 2, 2015 at 4:03 AM, Matthieu Brucher < > matthieu.bruc...@gmail.com> wrote: > >> 2015-07-01 19:43 GMT+01:00 Andreas Mueller : >> > >> > >> > On 07/01/2015 02:42 PM, Lars B

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Fred Mailhot
actually an answer to my question. FM. On 1 July 2015 at 11:42, Lars Buitinck wrote: > 2015-07-01 16:27 GMT+02:00 Fred Mailhot : > > 2) The gensim implementation predates the patenting > > Does that matter? > > >

Re: [Scikit-learn-general] Library of pre-trained models

2015-07-01 Thread Fred Mailhot
1) The upshot seems to be that it's a defensive patent, and in any case the code was released under Apache 2.0, so it's fine to use. https://code.google.com/p/word2vec/ https://groups.google.com/forum/#!topic/word2vec-toolkit/1hID9F74_Ho 2) The gensim implementation predates the patenting (thanks

Re: [Scikit-learn-general] Library of pre-trained models

2015-06-30 Thread Fred Mailhot
Tangent: Are we even allowed to use word2vec anymore, now that Goog has patented it? (in any case, I'll be looking a bit more closely at GloVe) F. On 30 June 2015 at 19:26, Mathieu Blondel wrote: > For unsupervised models that take a long time to train, such as deep > learning or word2vec based

Re: [Scikit-learn-general] issue with custom regressor in the pipeline

2015-05-19 Thread Fred Mailhot
Parenthesis error in the estimators list? estimators = [('my_regressor', myRegressor(blahblah)), ...] On 19 May 2015 at 15:47, Pagliari, Roberto wrote: > I'm trying to add a custom regressor to a pipeline. > For debugging purposes I commented everything out. > > class m

[Scikit-learn-general] Grid searching over FeatureUnion.transformer_weights

2015-05-19 Thread Fred Mailhot
Hi all, It appears that FeatureUnion.transformer_weights isn't exposed by the get_params() method, which in turn means that it isn't grid-searchable, which seems unfortunate to me (I've had cause to do so manually recently, and wished it could be automated). Is this something that other people ar

Re: [Scikit-learn-general] Integrating HashingVectorizer into Pipeline

2015-05-07 Thread Fred Mailhot
I think possibly you want the TfidfTransformer, *before* the HashingVectorizer...BUT...the documentation for the HashingVectorizer appears to discount the possibility of IDF-weighting: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html On 7 Ma

Re: [Scikit-learn-general] Re : Pull Request : Renyi entropy and Cauchy-Schwartz mutual information

2015-02-23 Thread Fred Mailhot
A good MI-based feature selector would be welcome, I think. Well, by me, anyway. On 23 February 2015 at 09:37, Andy wrote: > Hi Cecilia. > An MI estimate currently seems a bit out of scope of sklearn. > What context would a user apply it in? > Sklearn currently contains more out-of-the-box meth

Re: [Scikit-learn-general] NIPS

2014-11-18 Thread Fred Mailhot
I'm going to be at the ML+NLP workshop. On 18 November 2014 07:32, Mathieu Blondel wrote: > Hi, > > Anyone from the mailing-list going to NIPS this year? > > See you there, > Mathieu > > > -- > Download BIRT iHub F-Type

Re: [Scikit-learn-general] Sensitivity analysis

2014-01-23 Thread Fred Mailhot
Is your aim to use this information for feature selection, or do you actually want to see which features are being maximally weighted? There's a SO question that addresses the latter use: http://stackoverflow.com/questions/6697/how-to-get-most-informative-features-for-scikit-learn-classifiers

Re: [Scikit-learn-general] K Nearest Neighbour with 3d array and custom distance metric

2014-01-10 Thread Fred Mailhot
There are a few implementations of DTW in Cython floating around...I think mblondel has one. Maybe you could tweak one of these and see whether it yields a useful speed-up? https://github.com/SnippyHolloW/DTW_Cython http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/ https://gi

Re: [Scikit-learn-general] Save trained classifier

2013-12-19 Thread Fred Mailhot
On 19 December 2013 15:16, Olivier Grisel wrote: > [...] > But on the other hand that makes it possible to [...] to memory map the > large parameter > arrays by passing mmap_mode='r' to joblib.load for instance. > > Memory mapping can be useful to share the memory of models loaded in > several py

Re: [Scikit-learn-general] Feature Filtering

2013-10-15 Thread Fred Mailhot
Use the same DictVectorizer that you called fit_transform() on with the training data, but just call transform() for the test data... dv = DictVectorizer() train_feats = dv.fit_transform(train_feature_dict) test_feats = dv.transform(test_feature_dict) On 15 October 2013 03:52, Lars Buitinck w

Re: [Scikit-learn-general] HMM with von Mises Emmissions

2013-10-14 Thread Fred Mailhot
On 14 October 2013 20:48, Robert McGibbon wrote: [...] > > p.s. core devs: pretty please don't remove the HMM code from the scikit :) > +1E6 -- October Webinars: Code for Performance Free Intel webinars can help you acc

[Scikit-learn-general] EMNLP?

2013-09-25 Thread Fred Mailhot
Hi list, Just wondering whether anyone on here in planning on attending EMNLP. I'll be there, and as a heavy user (and hopeful eventual contributor), I'd love to meet with some of you. Fred. -- October Webinars: Code for

Re: [Scikit-learn-general] Representing classifiers outside of Python

2013-09-23 Thread Fred Mailhot
FYI, I've used sklearn's LogisticRegression in an online/real-time text classification app without having to dig into the internals and gotten ~2.5ms response time (including vectorizing; vocab size ~200k). On 23 September 2013 06:37, Peter Prettenhofer wrote: > We don't have a PMML interface y

[Scikit-learn-general] Vectorization/tokenization question...

2013-07-19 Thread Fred Mailhot
Hello list... I'm a huge fan of sklearn and use it daily at work. I was confused by the results of some recent text classification experiments and started looking more closely at the vectorization code. I'm wondering about the logic behind: 1) not doing stopword removal for the char_wb analyzer

Re: [Scikit-learn-general] Vectorization/tokenization question...

2013-07-19 Thread Fred Mailhot
Oh, right (duh)...I wasn't thinking clearly about the padding for char_wb. I'll do some tests with stopword removal for char_wb and submit a PR if it looks worthwhile. Cheers, Fred. On 19 July 2013 13:27, Olivier Grisel wrote: > 2013/7/19 Fred Mailhot : > > Hello

Re: [Scikit-learn-general] Text processing using nltk, sklearn and pandas

2013-07-12 Thread Fred Mailhot
On 12 July 2013 09:48, Lars Buitinck wrote: > 2013/7/11 Tom Fawcett : > [...] > > I guess because it's terribly slow. I recently tried to cluster a > sample of Wikipedia text at the word level. What kind of results did you get? I did some work recently clustering short-form text and was general

Re: [Scikit-learn-general] Sklearn book?

2013-02-11 Thread Fred Mailhot
riting a book would >> probably mean quitting jobs >> for a couple of month, stalling research and basically not making any >> money (From what I read, writing an O'Reilly book >> pays less than any research position). >> >> So I don't see that happeni

[Scikit-learn-general] Sklearn book?

2013-02-11 Thread Fred Mailhot
Hi list, Is anyone working on a book showcasing scikit-learn? I'm thinking something along the lines of "Mahout In Action", that would showcase each of the parts of scikit-learn and provide a dead-tree reference with a lot of worked-out examples. I suppose it would make sense to wait for a 1.0 rel

Re: [Scikit-learn-general] Error when chosing large number of clusters

2013-02-01 Thread Fred Mailhot
I just had the same issue recently. It's been fixed in the dev (0.14) branch. If you pull/build/install that, everything should be fine. F. On 1 February 2013 13:40, Vinay B, wrote: > >From the scikit script at > http://scikit-learn.org/dev/_downloads/document_clustering.py , it > appears as t

Re: [Scikit-learn-general] Text document clustering: How can I access the actual clustered documents

2013-01-31 Thread Fred Mailhot
Given a fitted KMeans named "km", and a numpy array of documents, to get a list of documents associated with cluster i: documents[np.where(km.labels_ == i)] Not sure what you mean by "a list of cluster terms", though (a list of all terms from all docs associated with a given cluster?)... On 31

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
On 15 November 2012 23:20, Andreas Mueller wrote: > [...] > You can give GridSearchCV not only a grid but also a list of grids. > I would go with that. > (is that sufficiently documented?) > This doesn't appear to be document (at least not at http://scikit-learn.org/dev/modules/generated/sklearn

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
arning with Scikit? I have a data set that is > > 20gb that I want to train on I don't think I can do that easily, so > what should I do? > > Thanks, > Shomiron Ghose > > > On 15 November 2012 15:45, Fred Mailhot wrote: > >> Dear list, >> >&

Re: [Scikit-learn-general] GridSearch example

2012-11-16 Thread Fred Mailhot
Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying those out today. And @amueller I've been following the development of your PR for the random sampling of param space with great interest. But back to the initial problem...it seems that an empty input is the cause. My raw d

Re: [Scikit-learn-general] GridSearch example

2012-11-15 Thread Fred Mailhot
the error is related to n_jobs, not a specific classifier? > Could you run with n_jobs=1 and a very small training set (like 100 > examples or something) > and see if it runs through? > (Actually I'm totally clueless but that doesn't look like a > multiprocessing error to me

Re: [Scikit-learn-general] GridSearch example

2012-11-15 Thread Fred Mailhot
sr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 159.1.0) Thanks, Fred. On 15 November 2012 12:56, Andreas Mueller wrote: > Hi Fred. > The link is dead for me. > Do you link against Accelerate (not sure if this is relevant)? > > Cheers, > Andy > &

[Scikit-learn-general] GridSearch example

2012-11-15 Thread Fred Mailhot
Dear list, I'm using GridSearchCV to do some simple model selection for a text classification task. I've got it working (see below for caveat), but I'm not convinced that I'm making the best use of this tool. If someone has the time/inclination, I'd love a set of eyes to check the following gist t

Re: [Scikit-learn-general] random forest question

2012-10-26 Thread Fred Mailhot
On 26 October 2012 16:58, Richard T. Guy wrote: > Hey Scikit-Learn, > > I've been working on some changes to the RandomForest code and I had a > few questions. > > First, it looks like the function > def _partition_features(forest, n_total_features): > partitions features evenly across cores. Am

[Scikit-learn-general] LogisticRegression to initiate SGDClassifier

2012-07-25 Thread Fred Mailhot
Hi all, I've got a text classification problem on which LogisticRegression consistently outperforms SGDClassifier(loss="log") by a few percentage points on the smallish [O(10^5) points] datasets I've been using for initial development/testing. The data set I'll ultimately be using for training is

Re: [Scikit-learn-general] Online learning

2012-07-14 Thread Fred Mailhot
On 14 July 2012 04:22, Olivier Grisel wrote: > 2012/7/13 Abhi : > > Hello, > >My problem is to classify a set of 200k+ emails into approx. 2800 > categories. > > Currently the method I am using is calculating tfidf and using > LinearSVC() > > [with a good accuracy of 98%] for classification

[Scikit-learn-general] SGDClassifier(loss="log")...

2012-06-17 Thread Fred Mailhot
Dear all, Just *bump*ing my last two questions. Apologies if this is considered poor etiquette... Thanks! -- Forwarded message -- From: Fred Mailhot Date: 15 June 2012 17:22 [...] 1) I'd like to compute the class probs; are the probs for the individual OvR classifiers (e

Re: [Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

2012-06-15 Thread Fred Mailhot
e than 50% of your RAM) you'll run > into troubles. > > best, > Peter > > > 2012/6/15 Fred Mailhot : > > Dear all, > > > > What are the advantages of choosing one of the Subject line classifiers > over > > the other? At a quick gl

[Scikit-learn-general] LogisticRegression versus SGDClassifier(loss="log")?

2012-06-15 Thread Fred Mailhot
Dear all, What are the advantages of choosing one of the Subject line classifiers over the other? At a quick glance, I see the following: - LogisticRegression implements predict_proba for the multiclass case, while SGDClassifier doesn't - SGDClassifier(loss="log") lets you specify multiple CPUs f