[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743922#comment-14743922 ] Simeon Simeonov edited comment on SPARK-10574 at 9/14/15 5:55 PM: -- [~josephkb] this makes sense. There are a few decisions to make: # Which data types should we support out-of-the-box hashing for? HashingTF is typically used on strings but the implementation is currently not restricted to that. # If we choose to support them, how will the hashing happen for non-String types? For example, does a Double get converted to binary first (as the toString() representation may not be perfectly accurate) or do we call this a feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal here should be to have a rock-solid, deterministic definition that works the same everywhere. # How do we safely open the door for user-provided hashing functions? This could come at no extra cost through a simple trait and the parameter specifying the hashing function being a class name that must implement that trait, with Spark providing a murmur and a "native" implementation. I tend to prefer this simple pattern to hard-coded decision logic instantiating a finite set of classes. This would allow Spark users to experiment with xxHash, CityHash, minwise hashing and, if they so choose, even forms of locality-sensitive hashing. New hashes could be contributed to the project or become discoverable on spark-packages. It would also allow for analysis patterns down the road, e.g., a decorator class that analyzes the distribution of collisions as a side effect to help practitioners choose the right hashing function. I'd appreciate your thoughts. was (Author: simeons): [~josephkb] this makes sense. There are a few decisions to make: # Which data types should we support out-of-the-box hashing for? HashingTF is typically used on strings but the implementation is currently not restricted to that. # If we choose to support them, how will the hashing happen for non-String types? For example, does a Double get converted to binary first (as the toString() representation may not be perfectly accurate) or do we call this a feature (a tiny bit of LSH to make reasoning more robust)? I believe our goal here should be to have a rock-solid, deterministic definition that works the same everywhere. # How do we safely open the door for user-provided hashing functions? This could come at no extra cost through a simple trait and the parameter specifying the hashing function being a class name that must implement that trait, with Spark providing a murmur and a "native" implementation. I tend to prefer this simple pattern to hard-coded decision logic instantiating a finite set of classes. This would allow Spark users to experiment with xxHash, CityHash, minwise hashing and, if they so choose, even forms of locality-sensitive hashing. New hashes could be contributed to the project or become discoverable on spark-packages. It would also allow for analysis patterns down the road, e.g., a decorator class that analyzes the distribution of collisions as a side effect to help practitioners choose the right hashing function. I'd appreciate your thoughts. > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Priority: Critical > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.scala-lang.org/api/2.10.4/#scala.util.hashing.MurmurHash3$] > which is also included in the standard Scala libraries and is the hashing > choice of fast learners such as Vowpal Wabbit, scikit-learn and others. If > Spark users apply {{HashingTF}} only to very short, dictio
[jira] [Comment Edited] (SPARK-10574) HashingTF should use MurmurHash3
[ https://issues.apache.org/jira/browse/SPARK-10574?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15248627#comment-15248627 ] Simeon Simeonov edited comment on SPARK-10574 at 4/19/16 9:05 PM: -- [~josephkb] I agree that it would be an improvement. The issue I see with the current patch is that it would be an incompatible API change in the future (specifying hashing functions as objects and not by name). If we make just this one change everything else can be handled with no API changes, e.g., seeds are just constructor parameters or closure variables available to the hashing function and collision detection is just decoration. That's my practical argument related to MLlib. Beyond that, there are multiple arguments related to the usability, testability and maintainability of the Spark codebase, which has high code change velocity from a large number of committers, which contributes to a high issue rate. The simplest way to battle this is one design decision at a time. The PR hard-codes what is essentially a strategy pattern in the implementation of an object. It conflates responsibilities. It introduces branching this makes testing and documentation more complicated. If hashing functions are externalized, they could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as input it could also be tested much more simply with any function. The behavior and the APIs become simpler to document because they do one thing. Etc. Perhaps I'm only seeing the benefits of externalizing the hashing strategy and missing the complexity in what I propose? We have ample examples of Spark APIs using functions as inputs so there are standard ways to handle this across languages. We don't need a custom trait if we stick to {{Any}} as the hashing function input. What else could be a problem? was (Author: simeons): [~josephkb] I agree that it would be an improvement. The issue I see with the current patch is that it would be an incompatible API change in the future (specifying hashing functions as objects and not by name). If we make just this one change everything else can be handled with no API changes, e.g., seeds are just constructor parameters or closure variables available to the hashing function and collision detection is just decoration. That's my practical argument related to MLlib. Beyond that, there are multiple arguments related to the usability, testability and maintainability of the Spark codebase, which has high code change velocity from a large number of committers, which contributes to a high issue rate. The simplest way to battle this is one design decision at a time. The PR hard-codes what is essentially a strategy pattern in the implementation of an object. It conflates responsibilities. It introduces branching this makes testing and documentation more complicated. If hashing functions are externalized, they could be trivially tested. If {{HashingTF}} took a {{Function1[Any, Int]}} as input it could also be tested much more simply with any function. The documentation and the APIs become simpler to document because they do one thing. Etc. Perhaps I'm only seeing the benefits of externalizing the hashing strategy and missing the complexity in what I propose? We have ample examples of Spark APIs using functions as inputs so there are standard ways to handle this across languages. We don't need a custom trait if we stick to {{Any}} as the hashing function input. What else could be a problem? > HashingTF should use MurmurHash3 > > > Key: SPARK-10574 > URL: https://issues.apache.org/jira/browse/SPARK-10574 > Project: Spark > Issue Type: Sub-task > Components: MLlib >Affects Versions: 1.5.0 >Reporter: Simeon Simeonov >Assignee: Yanbo Liang > Labels: HashingTF, hashing, mllib > > {{HashingTF}} uses the Scala native hashing {{##}} implementation. There are > two significant problems with this. > First, per the [Scala > documentation|http://www.scala-lang.org/api/2.10.4/index.html#scala.Any] for > {{hashCode}}, the implementation is platform specific. This means that > feature vectors created on one platform may be different than vectors created > on another platform. This can create significant problems when a model > trained offline is used in another environment for online prediction. The > problem is made harder by the fact that following a hashing transform > features lose human-tractable meaning and a problem such as this may be > extremely difficult to track down. > Second, the native Scala hashing function performs badly on longer strings, > exhibiting [200-500% higher collision > rates|https://gist.github.com/ssimeonov/eb12fcda75615e4a8d46] than, for > example, > [MurmurHash3|http://www.