[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248627#comment-15248627
 ] 

Simeon Simeonov commented on SPARK-10574:
-----------------------------------------

[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The 
documentation and the APIs become simpler to document because they do one 
thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?

> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Assignee: Yanbo Liang
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to