how to find near duplicate items from given dataset using spark

Somnath Pandeya Thu, 02 Apr 2015 01:20:04 -0700

Hi All,

I want to find near duplicate items from given dataset
For e.g consider a data set


1.       Cricket,bat,ball,stumps

2.       Cricket,bowler,ball,stumps,

3.       Football,goalie,midfielder,goal

4.       Football,refree,midfielder,goal,
Here 1 and 2 are near duplicates (only field 2 is different ) and 3 and 4 are 
near duplicates(only 2 field is different)

This is what I did
Created an Article class and implemented equls and hashcode method (my hash 
code method returns constant (1) for all objecst).
And in spark I am using article as a key doing group by on the article.
Is this approach correct, or is there any better approach.

This is how my code looks like.

Article Class
public class Article implements Serializable {

private static final long serialVersionUID = 1L;
       private String first;
       private String second;
       private String third;
       private String fourth;

       public Article() {
              set("", "", "", "");
       }

       public Article(String first, String second, String third, String fourth) 
{
              // super();
              set(first, second, third, fourth);
       }

       @Override
       public int hashCode() {
              int result = 1;
              return result;
       }

       @Override
       public boolean equals(Object obj) {
              if (this == obj)
                     return true;
              if (obj == null)
                     return false;
              if (getClass() != obj.getClass())
                     return false;
              Article other = (Article) obj;
              if ((first.equals(other.first) || second.equals(other.second)
                           || third.equals(other.third) || 
fourth.equals(other.fourth))) {
                     return true;
              } else {
                     return false;
              }
       }

       private void set(String first, String second, String third, String 
fourth) {
              this.first = first;
              this.second = second;
              this.third = third;
              this.fourth = fourth;
       }


            Spark Code
       public static void main(String[] args) throws Exception {

              SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount")
                           .setMaster("local");
              JavaSparkContext ctx = new JavaSparkContext(sparkConf);
              JavaRDD<String> lines = ctx.textFile("data1/*");

              JavaRDD<Article> articles = lines.map(new Function<String, 
Article>() {

                     /**
              *
              */
                     private static final long serialVersionUID = 1L;

                     public Article call(String line) throws Exception {
                           String[] words = line.split(",");
                           // System.out.println(line);

                           Article article = new Article(words[0], words[1], 
words[2],
                                         words[3]);

                           return article;
                     }
              });


              JavaPairRDD<Article, String> articlePair = lines
                           .mapToPair(new PairFunction<String, Article, 
String>() {

                                  public Tuple2<Article, String> call(String 
line)
                                                throws Exception {

                                         String[] words = line.split(",");
                                         // System.out.println(line);

                                         Article article = new 
Article(words[0], words[1],
                                                       words[2], words[3]);
                                         return new Tuple2<Article, 
String>(article, line);
                                  }
                           });

              JavaPairRDD<Article, Iterable<String>> articlePairs = articlePair
                           .groupByKey();


              Map<Article, Iterable<String>> dupArticles = articlePairs
                           .collectAsMap();

              System.out.println("size {} " + dupArticles.size());

              Set<Article> uniqueArticle = dupArticles.keySet();

              for (Article article : uniqueArticle) {
                     Iterable<String> temps = dupArticles.get(article);
                     System.out.println("keys " + article);
                     for (String string : temps) {
                           System.out.println(string);
                     }
                     System.out.println("==============");
              }
              ctx.close();
              ctx.stop();
       }
}


**************** CAUTION - Disclaimer *****************
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
for the use of the addressee(s). If you are not the intended recipient, please
notify the sender by e-mail and delete the original message. Further, you are 
not
to copy, disclose, or distribute this e-mail or its contents to any other 
person and
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken
every reasonable precaution to minimize this risk, but is not liable for any 
damage
you may sustain as a result of any virus in this e-mail. You should carry out 
your
own virus checks before opening the e-mail or attachment. Infosys reserves the
right to monitor and review the content of all messages sent to or from this 
e-mail
address. Messages sent to or from this e-mail address may be stored on the
Infosys e-mail system.
***INFOSYS******** End of Disclaimer ********INFOSYS***

how to find near duplicate items from given dataset using spark

Reply via email to