dreas Mueller wrote:
> What's the type of self.custom?
>
> Also, you can step into the debugger to see which function it is that can
> not be pickled.
>
>
>
>
> On 04/05/2016 04:14 PM, Fred Mailhot wrote:
>
> Hi all,
>
> I've got a pipeline with some
Hi all,
I've got a pipeline with some custom transformers that's not pickling, and
I'm not sure why. I've had this previously when using custom preprocessors
& tokenizers with CountVectorizers. I dealt with it then by defining the
custom bits at the module level.
I assumed I could avoid that by c
I imagine a lot of people might be interested in this, but be in a position
where they need to justify bringing in a new package that mimics sklearn,
rather than just using the linear models that are already available there.
Could you day a but more about how/why this is better?
Thanks!
Fred.
On M
e
> estimator (with any fitted model discarded) in constructing ensembles,
> cross validation, etc. While none of the scikit-learn library of estimators
> do this, in practice you can overload get_params to define your own
> parameter listing. See
> http://scikit-learn.org/stable/devel
Hello list,
Firstly, thanks for this incredible package; I use it daily at work. Now on
to the meat: I'm trying to subclass TfidfVectorizer and running into
issues. I want to specify an extra param for __init__() that points to a
file that gets used in build_analyzer(). Skipping irrelevant bits, I
t, only space between
>> terms.
>>
>> Best,
>> Ehsan
>>
>>
>> On Thu, Nov 19, 2015 at 11:13 AM, Fred Mailhot
>> wrote:
>>
>>> Have you checked that your other program tokenizes the same way as the
>>> default sklearn tokeniza
Have you checked that your other program tokenizes the same way as the
default sklearn tokenization?
On 19 November 2015 at 11:09, Ehsan Asgari wrote:
> Hi,
>
> Thank you, but it didn't work.
> I checked len(tf.vocabulary_) and it is also 1900 instead of 1914.
> I have another program that cou
://www.google.com/patents/US9037464
> Filed on 15 March 2013
>
> On Thu, Jul 2, 2015 at 4:03 AM, Matthieu Brucher <
> matthieu.bruc...@gmail.com> wrote:
>
>> 2015-07-01 19:43 GMT+01:00 Andreas Mueller :
>> >
>> >
>> > On 07/01/2015 02:42 PM, Lars B
actually an answer to my
question.
FM.
On 1 July 2015 at 11:42, Lars Buitinck wrote:
> 2015-07-01 16:27 GMT+02:00 Fred Mailhot :
> > 2) The gensim implementation predates the patenting
>
> Does that matter?
>
>
>
1) The upshot seems to be that it's a defensive patent, and in any case the
code was released under Apache 2.0, so it's fine to use.
https://code.google.com/p/word2vec/
https://groups.google.com/forum/#!topic/word2vec-toolkit/1hID9F74_Ho
2) The gensim implementation predates the patenting
(thanks
Tangent: Are we even allowed to use word2vec anymore, now that Goog has
patented it? (in any case, I'll be looking a bit more closely at GloVe)
F.
On 30 June 2015 at 19:26, Mathieu Blondel wrote:
> For unsupervised models that take a long time to train, such as deep
> learning or word2vec based
Parenthesis error in the estimators list?
estimators = [('my_regressor', myRegressor(blahblah)),
...]
On 19 May 2015 at 15:47, Pagliari, Roberto wrote:
> I'm trying to add a custom regressor to a pipeline.
> For debugging purposes I commented everything out.
>
> class m
Hi all,
It appears that FeatureUnion.transformer_weights isn't exposed by the
get_params() method, which in turn means that it isn't grid-searchable,
which seems unfortunate to me (I've had cause to do so manually recently,
and wished it could be automated).
Is this something that other people ar
I think possibly you want the TfidfTransformer, *before* the
HashingVectorizer...BUT...the documentation for the HashingVectorizer
appears to discount the possibility of IDF-weighting:
http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
On 7 Ma
A good MI-based feature selector would be welcome, I think. Well, by me,
anyway.
On 23 February 2015 at 09:37, Andy wrote:
> Hi Cecilia.
> An MI estimate currently seems a bit out of scope of sklearn.
> What context would a user apply it in?
> Sklearn currently contains more out-of-the-box meth
I'm going to be at the ML+NLP workshop.
On 18 November 2014 07:32, Mathieu Blondel wrote:
> Hi,
>
> Anyone from the mailing-list going to NIPS this year?
>
> See you there,
> Mathieu
>
>
> --
> Download BIRT iHub F-Type
Is your aim to use this information for feature selection, or do you
actually want to see which features are being maximally weighted? There's a
SO question that addresses the latter use:
http://stackoverflow.com/questions/6697/how-to-get-most-informative-features-for-scikit-learn-classifiers
There are a few implementations of DTW in Cython floating around...I think
mblondel has one. Maybe you could tweak one of these and see whether it
yields a useful speed-up?
https://github.com/SnippyHolloW/DTW_Cython
http://www.mblondel.org/journal/2009/08/31/dynamic-time-warping-theory/
https://gi
On 19 December 2013 15:16, Olivier Grisel wrote:
> [...]
> But on the other hand that makes it possible to [...] to memory map the
> large parameter
> arrays by passing mmap_mode='r' to joblib.load for instance.
>
> Memory mapping can be useful to share the memory of models loaded in
> several py
Use the same DictVectorizer that you called fit_transform() on with the
training data, but just call transform() for the test data...
dv = DictVectorizer()
train_feats = dv.fit_transform(train_feature_dict)
test_feats = dv.transform(test_feature_dict)
On 15 October 2013 03:52, Lars Buitinck w
On 14 October 2013 20:48, Robert McGibbon wrote:
[...]
>
> p.s. core devs: pretty please don't remove the HMM code from the scikit :)
>
+1E6
--
October Webinars: Code for Performance
Free Intel webinars can help you acc
Hi list,
Just wondering whether anyone on here in planning on attending EMNLP. I'll
be there, and as a heavy user (and hopeful eventual contributor), I'd love
to meet with some of you.
Fred.
--
October Webinars: Code for
FYI, I've used sklearn's LogisticRegression in an online/real-time text
classification app without having to dig into the internals and gotten
~2.5ms response time (including vectorizing; vocab size ~200k).
On 23 September 2013 06:37, Peter Prettenhofer wrote:
> We don't have a PMML interface y
Hello list...
I'm a huge fan of sklearn and use it daily at work. I was confused by the
results of some recent text classification experiments and started looking
more closely at the vectorization code.
I'm wondering about the logic behind:
1) not doing stopword removal for the char_wb analyzer
Oh, right (duh)...I wasn't thinking clearly about the padding for char_wb.
I'll do some tests with stopword removal for char_wb and submit a PR if it
looks worthwhile.
Cheers,
Fred.
On 19 July 2013 13:27, Olivier Grisel wrote:
> 2013/7/19 Fred Mailhot :
> > Hello
On 12 July 2013 09:48, Lars Buitinck wrote:
> 2013/7/11 Tom Fawcett :
> [...]
>
> I guess because it's terribly slow. I recently tried to cluster a
> sample of Wikipedia text at the word level.
What kind of results did you get? I did some work recently clustering
short-form text and was general
riting a book would
>> probably mean quitting jobs
>> for a couple of month, stalling research and basically not making any
>> money (From what I read, writing an O'Reilly book
>> pays less than any research position).
>>
>> So I don't see that happeni
Hi list,
Is anyone working on a book showcasing scikit-learn? I'm thinking something
along the lines of "Mahout In Action", that would showcase each of the
parts of scikit-learn and provide a dead-tree reference with a lot of
worked-out examples. I suppose it would make sense to wait for a 1.0
rel
I just had the same issue recently. It's been fixed in the dev (0.14)
branch. If you pull/build/install that, everything should be fine.
F.
On 1 February 2013 13:40, Vinay B, wrote:
> >From the scikit script at
> http://scikit-learn.org/dev/_downloads/document_clustering.py , it
> appears as t
Given a fitted KMeans named "km", and a numpy array of documents, to get a
list of documents associated with cluster i:
documents[np.where(km.labels_ == i)]
Not sure what you mean by "a list of cluster terms", though (a list of all
terms from all docs associated with a given cluster?)...
On 31
On 15 November 2012 23:20, Andreas Mueller wrote:
> [...]
> You can give GridSearchCV not only a grid but also a list of grids.
> I would go with that.
> (is that sufficiently documented?)
>
This doesn't appear to be document (at least not at
http://scikit-learn.org/dev/modules/generated/sklearn
arning with Scikit? I have a data set that is >
> 20gb that I want to train on I don't think I can do that easily, so
> what should I do?
>
> Thanks,
> Shomiron Ghose
>
>
> On 15 November 2012 15:45, Fred Mailhot wrote:
>
>> Dear list,
>>
>&
Thanks to all for the tips on GridSearch with FeatureUnion, I'll be trying
those out today. And @amueller I've been following the development of your
PR for the random sampling of param space with great interest.
But back to the initial problem...it seems that an empty input is the
cause. My raw d
the error is related to n_jobs, not a specific classifier?
> Could you run with n_jobs=1 and a very small training set (like 100
> examples or something)
> and see if it runs through?
> (Actually I'm totally clueless but that doesn't look like a
> multiprocessing error to me
sr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version
159.1.0)
Thanks,
Fred.
On 15 November 2012 12:56, Andreas Mueller wrote:
> Hi Fred.
> The link is dead for me.
> Do you link against Accelerate (not sure if this is relevant)?
>
> Cheers,
> Andy
>
&
Dear list,
I'm using GridSearchCV to do some simple model selection for a text
classification task. I've got it working (see below for caveat), but I'm
not convinced that I'm making the best use of this tool. If someone has the
time/inclination, I'd love a set of eyes to check the following gist t
On 26 October 2012 16:58, Richard T. Guy wrote:
> Hey Scikit-Learn,
>
> I've been working on some changes to the RandomForest code and I had a
> few questions.
>
> First, it looks like the function
> def _partition_features(forest, n_total_features):
> partitions features evenly across cores. Am
Hi all,
I've got a text classification problem on which LogisticRegression
consistently outperforms SGDClassifier(loss="log") by a few percentage
points on the smallish [O(10^5) points] datasets I've been using for
initial development/testing. The data set I'll ultimately be using for
training is
On 14 July 2012 04:22, Olivier Grisel wrote:
> 2012/7/13 Abhi :
> > Hello,
> >My problem is to classify a set of 200k+ emails into approx. 2800
> categories.
> > Currently the method I am using is calculating tfidf and using
> LinearSVC()
> > [with a good accuracy of 98%] for classification
Dear all,
Just *bump*ing my last two questions. Apologies if this is considered poor
etiquette...
Thanks!
-- Forwarded message --
From: Fred Mailhot
Date: 15 June 2012 17:22
[...]
1) I'd like to compute the class probs; are the probs for the individual
OvR classifiers (e
e than 50% of your RAM) you'll run
> into troubles.
>
> best,
> Peter
>
>
> 2012/6/15 Fred Mailhot :
> > Dear all,
> >
> > What are the advantages of choosing one of the Subject line classifiers
> over
> > the other? At a quick gl
Dear all,
What are the advantages of choosing one of the Subject line classifiers
over the other? At a quick glance, I see the following:
- LogisticRegression implements predict_proba for the multiclass case,
while SGDClassifier doesn't
- SGDClassifier(loss="log") lets you specify multiple CPUs f
42 matches
Mail list logo