I'm trying to understand the intuition behind the features method that Aaron used in one of his demos. I believe this feature will just work for detecting the character set (i.e., language used).
Can someone help ? def featurize(s: String): Vector = { val n = 1000 val result = new Array[Double](n) val bigrams = s.sliding(2).toArray for (h <- bigrams.map(_.hashCode % n)) { result(h) += 1.0 / bigrams.length } Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap)) }