From: Saba Sehrish <ssehr...@fnal.gov<mailto:ssehr...@fnal.gov>>
Date: March 9, 2015 at 4:11:07 PM CDT
To: <user-...@spark.apache.org<mailto:user-...@spark.apache.org>>
Subject: Using top, takeOrdered, sortByKey

I am using spark for a template matching problem. We have 77 million events in 
the template library, and we compare energy of each of the input event with the 
each of the template event and return a score. In the end we return best 10000 
matches with lowest score. A score of 0 is a perfect match.

I down sampled the problem to use only 50k events. For a single event matching 
across all the events in the template (50k) I see 150-200ms for score 
calculation on 25 cores (using YARN cluster), but after that when I perform 
either a top or takeOrdered or even sortByKey the time reaches to 25-50s.
So far I am not able to figure out why such a huge gap going from a list of 
scores to a list of top 1000 scores and why sorting or getting best X matches 
is being dominant by a large factor. One thing I have noticed is that it 
doesn’t matter how many elements I return the time range is the same 25-50s for 
10 - 10000 elements.

Any suggestions? if I am not using API properly?

scores is JavaPairRDD<Integer, Double>, and I do something like
numbestmatches is 10, 100, 10000 or any number.

List <Tuple2<Integer, Double>> bestscores_list = 
scores.takeOrdered(numbestmatches, new TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.top(numbestmatches, new 
TupleComparator());
Or
List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey();

Reply via email to