[R] How do you scale variables which consist of tokens

Alekseiy Beloshitskiy Fri, 23 Mar 2012 11:45:01 -0700

Dear All,
Let's suppose there's a case when you want to make a prediction using range of 
variables. Some variables are represented as set of words (tokens). For example 
there is a training set:
x1,x2,..,x7, y
where y - to be predicted (despite of the model to be used for prediction), and 
let's say:
x4 - variable which presented as words from google search query (number of 
words may be different in each observation). For example:
x4=(how,grow,tree) and can be presented in hashed form:
x4=(11111,22222,33333)


I need to scale this variable (x4) to be able to use it in model. I was 
thinking about scaling it with TF-IDF. In this way I can represent each 
observation of x4 as a scaled vector with N elements like:
x4=(0.0175105020782697,...0.019135397913606) //scaled with TF-IDF
However, it still isn't scaled properly (please correct me if I'm wrong) since 
I need x4 to be presented as INTEGRAL value for each observation to be able to 
use it in model. I assume the result of scaling should look like:
x4=0.06789324432 //integral value

Do you have any ideas how to do this?

Appreciate for any ideas.


-Aleksei

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How do you scale variables which consist of tokens

Reply via email to