Hello, I think my problem is related to the fact that the dataset is really unbalanced. My 3 classes distribution are 550k, 150k and 70k. And naivebayes make its classification also based on the probability of a class c over all documents. So probably this unbalance is making a big difference.
Lucas, I'm just using the pre-processing available through seq2sparse. Which is defining a minimum word frequency, and also a max document frequency percentage (which work as a stoplist). And yes, I'm using the tf-idf vectors for training and test. Actually I had never heard of PCA and LDA. I'll take a look on it. Thanks 2013/12/8 Lucas Fernandes Brunialti <lbrunia...@igcorp.com.br> > Hi, > > Fernando, to get a better understanding of correlation, you could think of > features as events in probability, then if the probability of the > intersection is high, the events are high correlated... > > I agree with Ted. But usually, naive bayes works well with text > classification when you have a good pre-processing phase, using pca, tf-idf > or lda... Are you doing any pre-processing? > On Dec 8, 2013 3:25 PM, "Ted Dunning" <ted.dunn...@gmail.com> wrote: > > > > > The problem of correlation of features is clearly present in text, but it > > is not so clear what the effect will be. For naive bayes this has the > > effect of making the classifier over confident but it usually still works > > reasonably well. For logistic regression without regularization it can > > cause the learning algorithm to fail (mahout'so logistic regression is > > regularized, btw). > > > > Empirical evidence dominates theory in this situation. > > > > Sent from my iPhone > > > > > On Dec 8, 2013, at 9:14, Fernando Santos < > fernandoleandro1...@gmail.com> > > wrote: > > > > > > Now just a theoretical doubt. In a text classification example, what > > would > > > it mean to have features that are high correlated? I mean, in this > case > > > our features are basically words, do you have an example of how these > > > features can not be independant? This concept is not really clear in my > > > mind... > > > -- Fernando Santos +55 61 8129 8505