I'm implementing a join between two datasets A and B by a String key, which is the name attribute. I need to match similar names in this join.
My first thought, given that I was implementing secondary sort to get the values extracted from database A before the values from database B, was to create a grouping comparator class and instead of using the compareTo method to group values by the natural key, use a string similarity algorithm, but it has not worked as expected, since that names that match in my algorithm wasn't mapped as the same key. See my code below. public class StringSimilarityGroupingComparator extends WritableComparator { protected StringSimilarityGroupingComparator() { super(JoinKeyTagPairWritable.class, true); } public int compare(WritableComparable w1, WritableComparable w2) { JoinKeyTagPairWritable k1 = (JoinKeyTagPairWritable) w1; JoinKeyTagPairWritable k2 = (JoinKeyTagPairWritable) w2; StringSimilarityMatcher nameMatcher = new StringSimilarityMatcher( StringSimilarityMatcher.NAME_MATCH); return nameMatcher.match(k1.getJoinKey(), k2.getJoinKey()) ? 0 : k1 .getJoinKey().compareTo(k2.getJoinKey()); } This approach makes total sense to me. Where was I mistaken? Isn't this the purpose of overriding the grouping comparator class?