You're right Sean, the implementation depends on hash code currently so may
differ. I opened a JIRA (which duplicated this one -
https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574
which is the active JIRA), for using murmurhash3 which should then be
consistent across platforms & langs (as well as more performant).

It's also odd (legacy I think) that the Python version has its own
implementation rather than calling into Java. That should also be changed
probably.
On Thu, 7 Apr 2016 at 17:59, Sean Owen <so...@cloudera.com> wrote:

> Let's say I use HashingTF in my Pipeline to hash a string feature.
> This is available in Python and Scala, but they hash strings to
> different values since both use their respective runtime's native hash
> implementation. This means that I create different feature vectors for
> the same input. While I can load/store something like a
> NaiveBayesModel across the two languages successfully, it seems like
> the hashing part doesn't translate.
>
> Is that accurate, or, have I completely missed a way to get the same
> hashing for the same input across languages?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to