From: Saba Sehrish <ssehr...@fnal.gov<mailto:ssehr...@fnal.gov>> Date: March 9, 2015 at 4:11:07 PM CDT To: <user-...@spark.apache.org<mailto:user-...@spark.apache.org>> Subject: Using top, takeOrdered, sortByKey
I am using spark for a template matching problem. We have 77 million events in the template library, and we compare energy of each of the input event with the each of the template event and return a score. In the end we return best 10000 matches with lowest score. A score of 0 is a perfect match. I down sampled the problem to use only 50k events. For a single event matching across all the events in the template (50k) I see 150-200ms for score calculation on 25 cores (using YARN cluster), but after that when I perform either a top or takeOrdered or even sortByKey the time reaches to 25-50s. So far I am not able to figure out why such a huge gap going from a list of scores to a list of top 1000 scores and why sorting or getting best X matches is being dominant by a large factor. One thing I have noticed is that it doesn’t matter how many elements I return the time range is the same 25-50s for 10 - 10000 elements. Any suggestions? if I am not using API properly? scores is JavaPairRDD<Integer, Double>, and I do something like numbestmatches is 10, 100, 10000 or any number. List <Tuple2<Integer, Double>> bestscores_list = scores.takeOrdered(numbestmatches, new TupleComparator()); Or List <Tuple2<Integer, Double>> bestscores_list = scores.top(numbestmatches, new TupleComparator()); Or List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();