I am not entirely sure I understand your question -- are you saying: * scoring a sample of 50k events is fast * taking the top N scores of 77M events is slow, no matter what N is
? if so, this shouldn't come as a huge surprise. You can't find the top scoring elements (no matter how small N is) unless you score all 77M of them. Very naively, you would expect scoring 77M events to take ~1000 times as long as scoring 50k events, right? The fact that it doesn't take that much longer is probably b/c of the overhead of just launching the jobs. On Mon, Mar 9, 2015 at 4:21 PM, Saba Sehrish <ssehr...@fnal.gov> wrote: > > > *From:* Saba Sehrish <ssehr...@fnal.gov> > *Date:* March 9, 2015 at 4:11:07 PM CDT > *To:* <user-...@spark.apache.org> > *Subject:* *Using top, takeOrdered, sortByKey* > > I am using spark for a template matching problem. We have 77 million > events in the template library, and we compare energy of each of the input > event with the each of the template event and return a score. In the end we > return best 10000 matches with lowest score. A score of 0 is a perfect > match. > > I down sampled the problem to use only 50k events. For a single event > matching across all the events in the template (50k) I see 150-200ms for > score calculation on 25 cores (using YARN cluster), but after that when I > perform either a top or takeOrdered or even sortByKey the time reaches to > 25-50s. > So far I am not able to figure out why such a huge gap going from a list > of scores to a list of top 1000 scores and why sorting or getting best X > matches is being dominant by a large factor. One thing I have noticed is > that it doesn’t matter how many elements I return the time range is the > same 25-50s for 10 - 10000 elements. > > Any suggestions? if I am not using API properly? > > scores is JavaPairRDD<Integer, Double>, and I do something like > numbestmatches is 10, 100, 10000 or any number. > > List <Tuple2<Integer, Double>> bestscores_list = > scores.takeOrdered(numbestmatches, new TupleComparator()); > Or > List <Tuple2<Integer, Double>> bestscores_list = > scores.top(numbestmatches, new TupleComparator()); > Or > List <Tuple2<Integer, Double>> bestscores_list = scores.sortByKey(); > >