Right,
I was also thinking about it, but since I have few thousands of unique words I 
'm not quite sure how it will work

I just posted my question with more detailed description here:
http://stats.stackexchange.com/questions/25355/multi-value-categorical-attributes-how-r

Really interesting case :)

Thank you,
-Alex
________________________________
From: Jessica Streicher [j.streic...@micromata.de]
Sent: 27 March 2012 15:24
To: Alekseiy Beloshitskiy
Cc: r-help@r-project.org
Subject: Re: [R] normalization of multi-value string variable

Hm.. so what you need is either

- one new feature for each activity that has a binary value
e.g.:
cust_id , cycling, swimming, cooking
1001     , 1          , 0                , 1

- one new feature that has a value corresponding to a certain combination of 
activities
so if you had just the three activities you would have 2^3 possible values
I'm not sure how useful that would be though for the classification.

(Would need to think about how to compute this, i'm new to R as well. Would 
probably just iterate over the data)

If you make one feature per activity, and you end up having too many to 
properly compute the svm, you might try to reduce it by other methods, PCA 
comes to mind for example, though i never used that on "binary" data before.


Am 27.03.2012 um 11:34 schrieb Alekseiy Beloshitskiy:

Thank you so much, Jessica,

The specific of my case is that I have a very detailed variable 'Interests' 
which may have several thousands of possible values. Usually each customer has 
3-10 different interests. For example:
customer_id|...|interests
10000001   |...| cycling, swimming, cooking
10000002   |...| cooking, singing, dancing

Total number of possible distinct values is several thousands. I m curious how 
to use these interests in SVM (represent as a vector of real numbers with 
several thousands of elements?).

If you have any ideas please let me know.


Thank you,
-Alex

________________________________
From: Jessica Streicher 
[j.streic...@micromata.de<mailto:j.streic...@micromata.de>]
Sent: 27 March 2012 11:18
To: Alekseiy Beloshitskiy
Subject: Re: [R] normalization of multi-value string variable

Well, not sure what you mean with scaling and normalizing strings, but if you 
want to represent the interests as numbers, you can do something like this:

n<-seq(1,length(unique(my_strings)))[factor(my_strings)]


Am 26.03.2012 um 18:50 schrieb Alekseiy Beloshitskiy:

Hi All,

I need to normalize/scale string variable which represents interests of 
customers (e.g., 'cycling, rollerblading, swimming' etc).

Does anybody know how to do this, I want then use it along with other numeric 
variables for SVM classification.

Appreciate for any advice.

-Alex

[[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org<mailto:R-help@r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



Velti anti-spam filter: Click 
here<https://www.mailcontrol.com/sr/r0FnbR2LtoLTndxI!oX7UvIItv2OGGpT0AcqlhvMu8o1Dzu7YBkufzUjcExl8H5fIQg52m9U+4B6aunJTqVygQ==>
 to report this email as spam.


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to