-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25049/#review51486
-----------------------------------------------------------



datafu-pig/src/main/java/datafu/pig/hash/SimHash.java
<https://reviews.apache.org/r/25049/#comment89887>

    could you please share the tutorial that describes the algorithm? Are there 
any other SimHash algorithms we could also support?



datafu-pig/src/main/java/datafu/pig/hash/SimHash.java
<https://reviews.apache.org/r/25049/#comment89842>

    It seems that here only tri-grams are used instead of n-gram generated, 
input parameter "n" is not used in this function? Should we use a sort of 
sliding window to implement this?
    
    private List<String> computeNGramShingles(String line, int n) {
    
         List<String> result = new ArrayList<String>(n);
    
         String[] circularQueue = new String[n];
         StringTokenizer st = new StringTokenizer(line);
    
         int index = 0;
         int circularQueueSize = 0;
    
         StringBuffer strBuf = new StringBuffer();
    
         while (st.hasMoreElements()) {
             String token = st.nextToken();
             if (circularQueueSize == n)
             {
                 strBuf.setLength(0);
                 for(int pn = 0; pn < n; pn++)
                 {
                    if (pn > 0)
                    {
                        strBuf.append(" ");
                    }
                    strBuf.append(circularQueue[(index + pn) % n]);
                 }
                 result.add(strBuf.toString());
                 index = (index + 1) % n;
                 circularQueueSize--;
             }
             circularQueue[(index + circularQueueSize) % n] = token;
             if (circularQueueSize < n)
             {
                 circularQueueSize++;
             }
         }
    
         if (circularQueueSize == n)
         {
             strBuf.setLength(0);
             for(int pn = 0; pn < n; pn++)
             {
                 if (pn > 0)
                 {
                    strBuf.append(" ");
                 }
                 strBuf.append(circularQueue[(index + pn) % n]);
             }
             result.add(strBuf.toString());
         }
    
         return result;
    }
    
    The complete test class: 
https://github.com/king821221/coding/blob/master/NGram.java


- wang jian


On Aug. 26, 2014, 1:17 a.m., Mohammad Amin wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/25049/
> -----------------------------------------------------------
> 
> (Updated Aug. 26, 2014, 1:17 a.m.)
> 
> 
> Review request for DataFu.
> 
> 
> Repository: datafu
> 
> 
> Description
> -------
> 
> DATAFU-67. Adding Simple SimHash to compute near duplicates.
> https://issues.apache.org/jira/browse/DATAFU-67
> 
> 
> Diffs
> -----
> 
>   datafu-pig/src/main/java/datafu/pig/hash/SimHash.java PRE-CREATION 
>   datafu-pig/src/test/java/datafu/test/pig/hash/HashTests.java 7ff8fb9 
> 
> Diff: https://reviews.apache.org/r/25049/diff/
> 
> 
> Testing
> -------
> 
> Unit tests passed.
> 
> 
> Thanks,
> 
> Mohammad Amin
> 
>

Reply via email to