Right, I was also thinking about it, but since I have few thousands of unique words I 'm not quite sure how it will work
I just posted my question with more detailed description here: http://stats.stackexchange.com/questions/25355/multi-value-categorical-attributes-how-r Really interesting case :) Thank you, -Alex ________________________________ From: Jessica Streicher [j.streic...@micromata.de] Sent: 27 March 2012 15:24 To: Alekseiy Beloshitskiy Cc: r-help@r-project.org Subject: Re: [R] normalization of multi-value string variable Hm.. so what you need is either - one new feature for each activity that has a binary value e.g.: cust_id , cycling, swimming, cooking 1001 , 1 , 0 , 1 - one new feature that has a value corresponding to a certain combination of activities so if you had just the three activities you would have 2^3 possible values I'm not sure how useful that would be though for the classification. (Would need to think about how to compute this, i'm new to R as well. Would probably just iterate over the data) If you make one feature per activity, and you end up having too many to properly compute the svm, you might try to reduce it by other methods, PCA comes to mind for example, though i never used that on "binary" data before. Am 27.03.2012 um 11:34 schrieb Alekseiy Beloshitskiy: Thank you so much, Jessica, The specific of my case is that I have a very detailed variable 'Interests' which may have several thousands of possible values. Usually each customer has 3-10 different interests. For example: customer_id|...|interests 10000001 |...| cycling, swimming, cooking 10000002 |...| cooking, singing, dancing Total number of possible distinct values is several thousands. I m curious how to use these interests in SVM (represent as a vector of real numbers with several thousands of elements?). If you have any ideas please let me know. Thank you, -Alex ________________________________ From: Jessica Streicher [j.streic...@micromata.de<mailto:j.streic...@micromata.de>] Sent: 27 March 2012 11:18 To: Alekseiy Beloshitskiy Subject: Re: [R] normalization of multi-value string variable Well, not sure what you mean with scaling and normalizing strings, but if you want to represent the interests as numbers, you can do something like this: n<-seq(1,length(unique(my_strings)))[factor(my_strings)] Am 26.03.2012 um 18:50 schrieb Alekseiy Beloshitskiy: Hi All, I need to normalize/scale string variable which represents interests of customers (e.g., 'cycling, rollerblading, swimming' etc). Does anybody know how to do this, I want then use it along with other numeric variables for SVM classification. Appreciate for any advice. -Alex [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org<mailto:R-help@r-project.org> mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Velti anti-spam filter: Click here<https://www.mailcontrol.com/sr/r0FnbR2LtoLTndxI!oX7UvIItv2OGGpT0AcqlhvMu8o1Dzu7YBkufzUjcExl8H5fIQg52m9U+4B6aunJTqVygQ==> to report this email as spam. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.