[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248514#comment-15248514
 ] 

Joseph K. Bradley commented on SPARK-10574:
-------------------------------------------

Copying comments from [~simeons] from the PR:
{quote}
When the "hashing trick" is used in practice, it is important to do things such 
as monitor, manage or randomize collisions. If there are problems, it is not 
uncommon to vary the hashing function. All this suggests that a hashing 
function should be treated as an object with a simple interface, perhaps as 
simple as Function1[Any, Int]. Collision monitoring can then be performed with 
a decorator with an accumulator. Collision management would be performed by 
varying the seed or adding salt. Collision randomization would be performed by 
varying the seed/salt with each run and/or running multiple models in 
production which are identical expect for the different seed/salt used.

The hashing trick is very important in ML and quite... tricky... to get working 
well for complex, high-dimension spaces, which Spark is perfect for. An 
implementation that does not treat the hashing function as a first class object 
would substantially hinder MLlib's capabilities in practice.
{quote}
--> This initial PR should be a big improvement, even if we just use 
MurmurHash3 without varied seed/salts like you're suggesting.  This also seems 
acceptable for now since it's what scikit-learn does.  But later PRs could add 
further improvements.

> HashingTF should use MurmurHash3
> --------------------------------
>
>                 Key: SPARK-10574
>                 URL: https://issues.apache.org/jira/browse/SPARK-10574
>             Project: Spark
>          Issue Type: Sub-task
>          Components: MLlib
>    Affects Versions: 1.5.0
>            Reporter: Simeon Simeonov
>            Assignee: Yanbo Liang
>              Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like strings 
> the hashing function choice will not be a big problem but why have an 
> implementation in MLlib with this limitation when there is a better 
> implementation readily available in the standard Scala library?
> Switching to MurmurHash3 solves both problems. If there is agreement that 
> this is a good change, I can prepare a PR. 
> Note that changing the hash function would mean that models saved with a 
> previous version would have to be re-trained. This introduces a problem 
> that's orthogonal to breaking changes in APIs: breaking changes related to 
> artifacts, e.g., a saved model, produced by a previous version. Is there a 
> policy or best practice currently in effect about this? If not, perhaps we 
> should come up with a few simple rules about how we communicate these in 
> release notes, etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to