Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Vinay B,
Hi, I tried again . I feel there's something wrong I'm doing with my code so far. In any case, the print loop I added was doc_idx = 0 for cluster_doc_filename in file_names: #predicted_cluster = km.predict(cluster_doc_filename) predicted_cluster = km.predict(doc_idx)#passin

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Vinay B,
ut current master, there was a bug in minibatch k means in > the release. > > > > "Vinay B," schrieb: >> >> So I tried your recommendations. The partial fit seems to operate to an >> extent. Then BOOM! It looks very similar to the example in >>

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-07 Thread Vinay B,
ackages/sklearn/cluster/k_means_.py", line 888, in _mini_batch_step centers[to_reassign] = new_centers ValueError: setting an array element with a sequence. On Thu, Feb 7, 2013 at 4:06 AM, Olivier Grisel wrote: > 2013/2/6 Vinay B, : > > Hi > > Almost there (I hope) , but not

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-06 Thread Vinay B,
ative=False, norm=l2, preprocessor=None, stop_words=english, strip_accents=None, token_pattern=(?u)\b\w\w+\b, tokenizer=None) Thanks Vinay On Wed, Feb 6, 2013 at 2:40 AM, Olivier Grisel wrote: > 2013/2/6 Vinay B, : > > > > Hi Olivier, > > Looking at the

Re: [Scikit-learn-general] Scikit-learn scalability options ?

2013-02-05 Thread Vinay B,
sklearn.feature_extraction.FeatureHasher for streams of categorical > data (e.g. a list of python dicts). > Have a look at the documentation here: > http://scikit-learn.org/dev/modules/feature_extraction.html > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel O

[Scikit-learn-general] Scikit-learn scalability options ?

2013-02-05 Thread Vinay B,
Hi, >From my newbie experiments last week, it appears that scikit loads all documents into memory (classification(training & testing) and clustering. This approach might not scale to the millions of (text) docs that I want to process. 1. Is there a recommended way to deal with large datasets? Exam

[Scikit-learn-general] Error when chosing large number of clusters

2013-02-01 Thread Vinay B,
>From the scikit script at http://scikit-learn.org/dev/_downloads/document_clustering.py , it appears as the number of clusters are set to the number of newsgroups subfolders. I'm guessing that's done more out of convenience . On the other hand, the users should be able to set an arbitrary number o

[Scikit-learn-general] Cluster Terms Output

2013-02-01 Thread Vinay B,
I thought I'd clarify this question in a separate thread Each individual cluster is usually associated with a set of significant terms. For example, a mahout kmeans cluster operation of the reuters-21578 dataset yields output like this :VL-21566{n=2 c=[1,000:2.589, 1.9:2.974, 10:2.289, 14:1.568,

[Scikit-learn-general] Fwd: Text document clustering: How can I access the actual clustered documents

2013-02-01 Thread Vinay B,
ews-bydate-test/alt.atheism/53413 3 : /home/vinayb/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/53167 3 : /home/vinayb/scikit_learn_data/20news_home/20news-bydate-train/alt.atheism/51241 . . -- Forwarded message ------ From: Vinay B, Date: Thu, Jan 31, 2013 at 5:20

[Scikit-learn-general] Text document clustering: How can I access the actual clustered documents

2013-01-31 Thread Vinay B,
Another newbie question. I'm not referring to a confusion matrix or similar summary. Rather, If I had a number of documents clustered using (say KMeans) into 3 clusters, .. how could I access 1. each cluster and a list of cluster terms? 2. a list of documents associated with each cluster? Thanks

[Scikit-learn-general] Text data training: UnicodeDecodeError

2013-01-31 Thread Vinay B,
Hi, I'm new to Scikit-learn and python (though not to programming) and am working my way through the examples. Aim: Train a model based on textual data and use the trained model to classify individual text files. Issue: I end up with Unicode errors : UnicodeDecodeError: 'utf8' codec can't decode by