I'm trying to understand the intuition behind the features method that
Aaron used in one of his demos. I believe this feature will just work for
detecting the character set (i.e., language used).

Can someone help ?


def featurize(s: String): Vector = {
  val n = 1000
  val result = new Array[Double](n)
  val bigrams = s.sliding(2).toArray

  for (h <- bigrams.map(_.hashCode % n)) {
    result(h) += 1.0 / bigrams.length
  }

  Vectors.sparse(n, result.zipWithIndex.filter(_._1 != 0).map(_.swap))
}

Reply via email to