> > > First the documents are normalized, then normalized sums of weights are > > computed instead of computing the word count. This is the key step which > > boosts the classification accuracy on text. I can move this to the > document > > vectorizer. > > > > And the idf weighting can be done on-line or in two passes. The two pass > approach is more precise, but not necessarily very much. A compromise is > also possible where the first pass is a small subset of documents (say > 10,000 docs). That keeps it really fast and that dictionary can be used as > the seed for the adaptive weighting (or just used directly). > > We already do the two pass in tfidf stage of Dictionary vectorizer. What I need for CNB are the sums of weights of features and labels and the total sum of weight of all cells in the matrix
> > > With this new vectorization, can we hash sparse features to a particular > id > > range and ask the tfidf job to compute tfidf for just that portion?. This > > means, I can delete away the tfidf calculation code for CNB. This can > exist > > as a separate vectorizer. And both clustering and classification can use > > it. > > It will partially kill its online nature. We can circumvent that using a > > Document-Frequency Map to compute approximate tf-idf during online stage > > > > I think that you misunderstand me a little bit, and I know that I am not > understanding what you are saying here. > Okay.. Lets take an example. Say you have users with text bio and the feature age, weight etc. text is sparse and we need to apply tfidf on it, while we should not on age and weight. So i this case, we need to hash the text into some range and do one pass or two pass idf calculation in that range. We need to leave the other features alone right. Otherwise by idf they will be squashed log(1)
