Hi All, I want to find near duplicate items from given dataset For e.g consider a data set
1. Cricket,bat,ball,stumps 2. Cricket,bowler,ball,stumps, 3. Football,goalie,midfielder,goal 4. Football,refree,midfielder,goal, Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are near duplicates(only 2 field is different) This is what I did Created an Article class and implemented equls and hashcode method (my hash code method returns constant (1) for all objecst). And in spark I am using article as a key doing group by on the article. Is this approach correct, or is there any better approach. This is how my code looks like. Article Class public class Article implements Serializable { private static final long serialVersionUID = 1L; private String first; private String second; private String third; private String fourth; public Article() { set("", "", "", ""); } public Article(String first, String second, String third, String fourth) { // super(); set(first, second, third, fourth); } @Override public int hashCode() { int result = 1; return result; } @Override public boolean equals(Object obj) { if (this == obj) return true; if (obj == null) return false; if (getClass() != obj.getClass()) return false; Article other = (Article) obj; if ((first.equals(other.first) || second.equals(other.second) || third.equals(other.third) || fourth.equals(other.fourth))) { return true; } else { return false; } } private void set(String first, String second, String third, String fourth) { this.first = first; this.second = second; this.third = third; this.fourth = fourth; } Spark Code public static void main(String[] args) throws Exception { SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount") .setMaster("local"); JavaSparkContext ctx = new JavaSparkContext(sparkConf); JavaRDD<String> lines = ctx.textFile("data1/*"); JavaRDD<Article> articles = lines.map(new Function<String, Article>() { /** * */ private static final long serialVersionUID = 1L; public Article call(String line) throws Exception { String[] words = line.split(","); // System.out.println(line); Article article = new Article(words[0], words[1], words[2], words[3]); return article; } }); JavaPairRDD<Article, String> articlePair = lines .mapToPair(new PairFunction<String, Article, String>() { public Tuple2<Article, String> call(String line) throws Exception { String[] words = line.split(","); // System.out.println(line); Article article = new Article(words[0], words[1], words[2], words[3]); return new Tuple2<Article, String>(article, line); } }); JavaPairRDD<Article, Iterable<String>> articlePairs = articlePair .groupByKey(); Map<Article, Iterable<String>> dupArticles = articlePairs .collectAsMap(); System.out.println("size {} " + dupArticles.size()); Set<Article> uniqueArticle = dupArticles.keySet(); for (Article article : uniqueArticle) { Iterable<String> temps = dupArticles.get(article); System.out.println("keys " + article); for (String string : temps) { System.out.println(string); } System.out.println("=============="); } ctx.close(); ctx.stop(); } } **************** CAUTION - Disclaimer ***************** This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS******** End of Disclaimer ********INFOSYS***