Hi,
I tried again . I feel there's something wrong I'm doing with my code so
far. In any case, the print loop I added was
doc_idx = 0
for cluster_doc_filename in file_names:
#predicted_cluster = km.predict(cluster_doc_filename)
predicted_cluster = km.predict(doc_idx)#passin
ut current master, there was a bug in minibatch k means in
> the release.
>
>
>
> "Vinay B," schrieb:
>>
>> So I tried your recommendations. The partial fit seems to operate to an
>> extent. Then BOOM! It looks very similar to the example in
>>
ackages/sklearn/cluster/k_means_.py", line
888, in _mini_batch_step
centers[to_reassign] = new_centers
ValueError: setting an array element with a sequence.
On Thu, Feb 7, 2013 at 4:06 AM, Olivier Grisel wrote:
> 2013/2/6 Vinay B, :
> > Hi
> > Almost there (I hope) , but not
ative=False, norm=l2, preprocessor=None,
stop_words=english, strip_accents=None,
token_pattern=(?u)\b\w\w+\b, tokenizer=None)
Thanks
Vinay
On Wed, Feb 6, 2013 at 2:40 AM, Olivier Grisel wrote:
> 2013/2/6 Vinay B, :
> >
> > Hi Olivier,
> > Looking at the
sklearn.feature_extraction.FeatureHasher for streams of categorical
> data (e.g. a list of python dicts).
> Have a look at the documentation here:
> http://scikit-learn.org/dev/modules/feature_extraction.html
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
O
Hi,
>From my newbie experiments last week, it appears that scikit loads all
documents into memory (classification(training & testing) and
clustering. This approach might not scale to the millions of (text)
docs that I want to process.
1. Is there a recommended way to deal with large datasets? Exam
>From the scikit script at
http://scikit-learn.org/dev/_downloads/document_clustering.py , it
appears as the number of clusters are set to the number of newsgroups
subfolders. I'm guessing that's done more out of convenience . On the
other hand, the users should be able to set an arbitrary number o
I thought I'd clarify this question in a separate thread
Each individual cluster is usually associated with a set of
significant terms. For example, a mahout kmeans cluster operation of
the reuters-21578 dataset yields output like this
:VL-21566{n=2 c=[1,000:2.589, 1.9:2.974, 10:2.289, 14:1.568,
ews-bydate-test/alt.atheism/53413
3 :
/home/vinayb/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/53167
3 :
/home/vinayb/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/51241
.
.
-- Forwarded message ------
From: Vinay B,
Date: Thu, Jan 31, 2013 at 5:20
Another newbie question.
I'm not referring to a confusion matrix or similar summary. Rather, If
I had a number of documents clustered using (say KMeans) into 3
clusters, .. how could I access
1. each cluster and a list of cluster terms?
2. a list of documents associated with each cluster?
Thanks
Hi,
I'm new to Scikit-learn and python (though not to programming) and am
working my way through the examples.
Aim: Train a model based on textual data and use the trained model to
classify individual text files.
Issue: I end up with Unicode errors : UnicodeDecodeError: 'utf8' codec
can't decode by
11 matches
Mail list logo