-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/21618/#review43444
-----------------------------------------------------------



datafu-pig/src/main/java/datafu/pig/hash/Hasher.java
<https://reviews.apache.org/r/21618/#comment77597>

    I think something like below would be clearer.  Rewrite as you see fit :)  
The current string implies it only supports non-cryptographic hashes.
    
    -------
    
    Computes a hash value of a string and outputs it in hex.  The default 
constructor produces a general purpose, non-cryptographic strength hash of at 
least 32-bits.  Additional constructors are available to choose alternative 
hash functions.  The default constructor is equivalent to choosing 'good-32'.



datafu-pig/src/main/java/datafu/pig/hash/Hasher.java
<https://reviews.apache.org/r/21618/#comment77608>

    Maybe a cleaner way to do this is to create a protected getRandom() method 
that you derive from and override.  You can create this test UDF under the test 
directory and it should be usable within the pig script.  This would actually 
be a good pattern for us to follow elsewhere too.  We have a JIRA to improve 
testing of UDFs that rely on randomness.



datafu-pig/src/main/java/datafu/pig/hash/Hasher.java
<https://reviews.apache.org/r/21618/#comment77606>

    Add a javadoc for this method.  Also document which algorithms support a 
random seed.



datafu-pig/src/main/java/datafu/pig/hash/Hasher.java
<https://reviews.apache.org/r/21618/#comment77607>

    I think it's important that each backend task uses the same seed when the 
'rand' option is used.  Otherwise the same data will hash to different values.  
You can ensure this by setting the seed in the front end in the UDF context.  
We have a ContextualEvalFunc that makes this easier.  You can derive from this, 
then in getOutputSchema you can call getInstanceProperties() and put the seed 
value into the map.  To get the seed, within the exec method you can call 
getInstanceProperties() and lazy load the seed from it. See 
EmptyBagToNullFields for an example using this pattern.


- Matthew Hayes


On May 19, 2014, 11:12 p.m., Philip (flip) Kromer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/21618/
> -----------------------------------------------------------
> 
> (Updated May 19, 2014, 11:12 p.m.)
> 
> 
> Review request for DataFu and Matthew Hayes.
> 
> 
> Bugs: DATAFU-47
>     https://issues.apache.org/jira/browse/DATAFU-47
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> Accompanies DATAFU-47 https://issues.apache.org/jira/browse/DATAFU-47 -- make 
> sure to apply the patch from DATAFU-46 too first
> 
> Questions for reviewers:
> 
> * If we upgrade Guava, we'd get sip24 (a fast cryptographically secure hash), 
> crc32 and adler32 (occasionally useful checksums). I can put the update in as 
> another patch. Should we upgrade?
> * This UDF provides the same hashes as MD5 and SHA udfs. Should those be 
> deprecated in favor of this? I can add the binhex functionality so that 
> nothing is lost.
> * If there's a standard way to do the dependency injection of a fixed random 
> number generator for the tests please advise.
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/hash/Hasher.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/hash/HashTests.java 7ff8fb9 
> 
> Diff: https://reviews.apache.org/r/21618/diff/
> 
> 
> Testing
> -------
> 
>  ./gradlew :datafu-pig:test -Dtest.single=HashTests 
> 
> 
> Thanks,
> 
> Philip (flip) Kromer
> 
>

Reply via email to