I should point out that actually the "ml" version of HashingTF does call
into Java so that will be consistent across Python and Java.

It's the "mllib" version in PySpark that implements its own version using
Pythons "hash" function (while Java uses Object.hashCode).

On Thu, 7 Apr 2016 at 18:19 Nick Pentreath <nick.pentre...@gmail.com> wrote:

> You're right Sean, the implementation depends on hash code currently so
> may differ. I opened a JIRA (which duplicated this one -
> https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-10574
> which is the active JIRA), for using murmurhash3 which should then be
> consistent across platforms & langs (as well as more performant).
>
> It's also odd (legacy I think) that the Python version has its own
> implementation rather than calling into Java. That should also be changed
> probably.
> On Thu, 7 Apr 2016 at 17:59, Sean Owen <so...@cloudera.com> wrote:
>
>> Let's say I use HashingTF in my Pipeline to hash a string feature.
>> This is available in Python and Scala, but they hash strings to
>> different values since both use their respective runtime's native hash
>> implementation. This means that I create different feature vectors for
>> the same input. While I can load/store something like a
>> NaiveBayesModel across the two languages successfully, it seems like
>> the hashing part doesn't translate.
>>
>> Is that accurate, or, have I completely missed a way to get the same
>> hashing for the same input across languages?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Reply via email to