Got it. This really needs to be done before vectorization, but you can segregate the output vector for different handling by passing in a view to different parts of the vector.
My recommendation is that you apply IDF using the weight dictionary in the vectorizer. That will let you have multiple text fields with different weighting schemes but still put all the results into a single result vector. As a side effect, if you put everything into a vector of dimension 1, then you get multi-field weighted inputs for free. On Tue, Jun 8, 2010 at 11:01 AM, Robin Anil <[email protected]> wrote: > > I think that you misunderstand me a little bit, and I know that I am not > > understanding what you are saying here. > > > > Okay.. Lets take an example. Say you have users with text bio and the > feature age, weight etc. > text is sparse and we need to apply tfidf on it, while we should not on age > and weight. So i this case, we need to hash the text into some range and do > one pass or two pass idf calculation in that range. We need to leave the > other features alone right. Otherwise by idf they will be squashed log(1)
