Dear All, Let's suppose there's a case when you want to make a prediction using range of variables. Some variables are represented as set of words (tokens). For example there is a training set: x1,x2,..,x7, y where y - to be predicted (despite of the model to be used for prediction), and let's say: x4 - variable which presented as words from google search query (number of words may be different in each observation). For example: x4=(how,grow,tree) and can be presented in hashed form: x4=(11111,22222,33333)
I need to scale this variable (x4) to be able to use it in model. I was thinking about scaling it with TF-IDF. In this way I can represent each observation of x4 as a scaled vector with N elements like: x4=(0.0175105020782697,...0.019135397913606) //scaled with TF-IDF However, it still isn't scaled properly (please correct me if I'm wrong) since I need x4 to be presented as INTEGRAL value for each observation to be able to use it in model. I assume the result of scaling should look like: x4=0.06789324432 //integral value Do you have any ideas how to do this? Appreciate for any ideas. -Aleksei [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.