Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Xiangrui Meng
Yes, the "bigram" in that demo only has two characters, which could separate different character sets. -Xiangrui On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei wrote: > The program computes hashing bi-gram frequency normalized by total number of > bigrams then filter out zero values. hashing is a eff

Re: Creating a feature vector from text before using with MLLib

2014-10-01 Thread Liquan Pei
The program computes hashing bi-gram frequency normalized by total number of bigrams then filter out zero values. hashing is a effective trick of vectorizing features. Take a look at http://en.wikipedia.org/wiki/Feature_hashing Liquan On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta wrote: > I'm

Creating a feature vector from text before using with MLLib

2014-10-01 Thread Soumya Simanta
I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used). Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new Array[Doub