----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25049/#review51486 -----------------------------------------------------------
datafu-pig/src/main/java/datafu/pig/hash/SimHash.java <https://reviews.apache.org/r/25049/#comment89887> could you please share the tutorial that describes the algorithm? Are there any other SimHash algorithms we could also support? datafu-pig/src/main/java/datafu/pig/hash/SimHash.java <https://reviews.apache.org/r/25049/#comment89842> It seems that here only tri-grams are used instead of n-gram generated, input parameter "n" is not used in this function? Should we use a sort of sliding window to implement this? private List<String> computeNGramShingles(String line, int n) { List<String> result = new ArrayList<String>(n); String[] circularQueue = new String[n]; StringTokenizer st = new StringTokenizer(line); int index = 0; int circularQueueSize = 0; StringBuffer strBuf = new StringBuffer(); while (st.hasMoreElements()) { String token = st.nextToken(); if (circularQueueSize == n) { strBuf.setLength(0); for(int pn = 0; pn < n; pn++) { if (pn > 0) { strBuf.append(" "); } strBuf.append(circularQueue[(index + pn) % n]); } result.add(strBuf.toString()); index = (index + 1) % n; circularQueueSize--; } circularQueue[(index + circularQueueSize) % n] = token; if (circularQueueSize < n) { circularQueueSize++; } } if (circularQueueSize == n) { strBuf.setLength(0); for(int pn = 0; pn < n; pn++) { if (pn > 0) { strBuf.append(" "); } strBuf.append(circularQueue[(index + pn) % n]); } result.add(strBuf.toString()); } return result; } The complete test class: https://github.com/king821221/coding/blob/master/NGram.java - wang jian On Aug. 26, 2014, 1:17 a.m., Mohammad Amin wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/25049/ > ----------------------------------------------------------- > > (Updated Aug. 26, 2014, 1:17 a.m.) > > > Review request for DataFu. > > > Repository: datafu > > > Description > ------- > > DATAFU-67. Adding Simple SimHash to compute near duplicates. > https://issues.apache.org/jira/browse/DATAFU-67 > > > Diffs > ----- > > datafu-pig/src/main/java/datafu/pig/hash/SimHash.java PRE-CREATION > datafu-pig/src/test/java/datafu/test/pig/hash/HashTests.java 7ff8fb9 > > Diff: https://reviews.apache.org/r/25049/diff/ > > > Testing > ------- > > Unit tests passed. > > > Thanks, > > Mohammad Amin > >