Re: [Scikit-learn-general] No methods seem to predict well

2012-08-11 Thread Zach Bastick
Sorry about that, the RTF reader is from the Pyth library: http://pypi.python.org/pypi/pyth/ I think that's all that's needed. Thanks for taking a look! Zach On 11/08/2012 22:55, Robert Layton wrote: On 12 August 2012 15:35, Zach Bastick > wrote: I have tr

Re: [Scikit-learn-general] No methods seem to predict well

2012-08-11 Thread Robert Layton
On 12 August 2012 15:35, Zach Bastick wrote: > I have tried various machine learning algorithms from scikit learn but > can't find a good prediction model. > The features I'm using are the tf-idf of set of text documents, > correlated with human ratings assigned to each document. I'm thinking > t

[Scikit-learn-general] No methods seem to predict well

2012-08-11 Thread Zach Bastick
I have tried various machine learning algorithms from scikit learn but can't find a good prediction model. The features I'm using are the tf-idf of set of text documents, correlated with human ratings assigned to each document. I'm thinking that I must be doing something wrong as the scores can'

[Scikit-learn-general] encoding with TfidfVectorizer

2012-08-11 Thread Zach Bastick
TfidfVectorizer is giving me an error on some texts that I am importing. I am importing them like this: for location in humanRatedText: if location[-3:].lower() == 'txt': f = open(dir+location, "r") t = f.read() texts.append(t) f.close() if location[-

[Scikit-learn-general] GridSearch over min_n and max_n in CountVectorizer

2012-08-11 Thread Andreas Mueller
Hey Everybody. If was just trying to use CountVectorizer but I have trouble using Gridsearch using both max_n and min_n. I guess the problem is that the parameter are conditioned on each other. Is there a nice way to do this? I guess I could generate lists of param_grids, i.e. one for each value

[Scikit-learn-general] Text classification and file names output

2012-08-11 Thread Jack Alan
Hi everyone, I'm working on text classification on the tutorial provided: document_classification_20newsgroups.py I wonder how I'll be able to print a list of the documents' names being used in the test folder with their predicted classes after classification process. The output wanted is someho