I wanted to run the groupBy(partition ) but this is not working. here first part in pairvendorData will be repeated multiple second part. Both are object do I need to overrite the equals and hash code? Is groupBy fast enough?
JavaPairRDD<VendorRecord, VendorRecord> pairvendorData =matchRdd.flatMapToPair( new PairFlatMapFunction<VendorRecord, VendorRecord, VendorRecord>(){ @Override public Iterable<Tuple2<VendorRecord,VendorRecord>> call( VendorRecord t) throws Exception { List<Tuple2<VendorRecord, VendorRecord>> pairs = new LinkedList<Tuple2<VendorRecord, VendorRecord>>(); CompanyMatcherHelper helper = new CompanyMatcherHelper(); MatcherKeys matchkeys=helper.getBlockinkeys(t); List<VendorRecord> Matchedrecords =ckdao.getMatchingRecordCknids(matchkeys); log.info("List Size is"+Matchedrecords.size()); for(int i=0;i<Matchedrecords.size();i++){ pairs.add( new Tuple2<VendorRecord,VendorRecord>(t,Matchedrecords.get(i))); } return pairs; } } );