You're right Sean, the implementation depends on hash code currently so may differ. I opened a JIRA (which duplicated this one - https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574 which is the active JIRA), for using murmurhash3 which should then be consistent across platforms & langs (as well as more performant).
It's also odd (legacy I think) that the Python version has its own implementation rather than calling into Java. That should also be changed probably. On Thu, 7 Apr 2016 at 17:59, Sean Owen <so...@cloudera.com> wrote: > Let's say I use HashingTF in my Pipeline to hash a string feature. > This is available in Python and Scala, but they hash strings to > different values since both use their respective runtime's native hash > implementation. This means that I create different feature vectors for > the same input. While I can load/store something like a > NaiveBayesModel across the two languages successfully, it seems like > the hashing part doesn't translate. > > Is that accurate, or, have I completely missed a way to get the same > hashing for the same input across languages? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >