[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3

2016-04-19 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15248627#comment-15248627
 ] 

Simeon Simeonov edited comment on SPARK-10574 at 4/19/16 9:05 PM:
--

[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The behavior 
and the APIs become simpler to document because they do one thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?


was (Author: simeons):
[~josephkb] I agree that it would be an improvement. The issue I see with the 
current patch is that it would be an incompatible API change in the future 
(specifying hashing functions as objects and not by name). If we make just this 
one change everything else can be handled with no API changes, e.g., seeds are 
just constructor parameters or closure variables available to the hashing 
function and collision detection is just decoration. 

That's my practical argument related to MLlib. 

Beyond that, there are multiple arguments related to the usability, testability 
and maintainability of the Spark codebase, which has high code change velocity 
from a large number of committers, which contributes to a high issue rate. The 
simplest way to battle this is one design decision at a time. The PR hard-codes 
what is essentially a strategy pattern in the implementation of an object. It 
conflates responsibilities. It introduces branching this makes testing and 
documentation more complicated. If hashing functions are externalized, they 
could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as 
input it could also be tested much more simply with any function. The 
documentation and the APIs become simpler to document because they do one 
thing. Etc. 

Perhaps I'm only seeing the benefits of externalizing the hashing strategy and 
missing the complexity in what I propose? We have ample examples of Spark APIs 
using functions as inputs so there are standard ways to handle this across 
languages. We don't need a custom trait if we stick to {{Any}} as the hashing 
function input. What else could be a problem?

> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Assignee: Yanbo Liang
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> 

[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3

2015-09-14 Thread Simeon Simeonov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14743922#comment-14743922
 ] 

Simeon Simeonov edited comment on SPARK-10574 at 9/14/15 5:55 PM:
--

[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is 
typically used on strings but the implementation is currently not restricted to 
that.
# If we choose to support them, how will the hashing happen for non-String 
types? For example, does a Double get converted to binary first (as the 
toString() representation may not be perfectly accurate) or do we call this a 
feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal 
here should be to have a rock-solid, deterministic definition that works the 
same everywhere.
# How do we safely open the door for user-provided hashing functions? This 
could come at no extra cost through a simple trait and the parameter specifying 
the hashing function being a class name that must implement that trait, with 
Spark providing a murmur and a "native" implementation. I tend to prefer this 
simple pattern to hard-coded decision logic instantiating a finite set of 
classes. This would allow Spark users to experiment with xxHash, CityHash, 
minwise hashing and, if they so choose, even forms of locality-sensitive 
hashing. New hashes could be contributed to the project or become discoverable 
on spark-packages. It would also allow for analysis patterns down the road, 
e.g., a decorator class that analyzes the distribution of collisions as a side 
effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.



was (Author: simeons):
[~josephkb] this makes sense. There are a few decisions to make:

# Which data types should we support out-of-the-box hashing for? HashingTF is 
typically used on strings but the implementation is currently not restricted to 
that.

# If we choose to support them, how will the hashing happen for non-String 
types? For example, does a Double get converted to binary first (as the 
toString() representation may not be perfectly accurate) or do we call this a 
feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal 
here should be to have a rock-solid, deterministic definition that works the 
same everywhere.

# How do we safely open the door for user-provided hashing functions? This 
could come at no extra cost through a simple trait and the parameter specifying 
the hashing function being a class name that must implement that trait, with 
Spark providing a murmur and a "native" implementation. I tend to prefer this 
simple pattern to hard-coded decision logic instantiating a finite set of 
classes. This would allow Spark users to experiment with xxHash, CityHash, 
minwise hashing and, if they so choose, even forms of locality-sensitive 
hashing. New hashes could be contributed to the project or become discoverable 
on spark-packages. It would also allow for analysis patterns down the road, 
e.g., a decorator class that analyzes the distribution of collisions as a side 
effect to help practitioners choose the right hashing function. 

I'd appreciate your thoughts.


> HashingTF should use MurmurHash3
> 
>
> Key: SPARK-10574
> URL: https://issues.apache.org/jira/browse/SPARK-10574
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.5.0
>Reporter: Simeon Simeonov
>Priority: Critical
>  Labels: HashingTF, hashing, mllib
>
> {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are 
> two significant problems with this.
> First, per the [Scala 
> documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for 
> {{hashCode}}, the implementation is platform specific. This means that 
> feature vectors created on one platform may be different than vectors created 
> on another platform. This can create significant problems when a model 
> trained offline is used in another environment for online prediction. The 
> problem is made harder by the fact that following a hashing transform 
> features lose human-tractable meaning and a problem such as this may be 
> extremely difficult to track down.
> Second, the native Scala hashing function performs badly on longer strings, 
> exhibiting [200-500% higher collision 
> rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for 
> example, 
> [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$]
>  which is also included in the standard Scala libraries and is the hashing 
> choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If 
> Spark users apply {{HashingTF}} only to very short, dictionary-like