Correct. Trading away scalability for increased performance is not an option for the standard Spark API.
On Tue, Jun 9, 2015 at 3:05 AM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > It would be even faster to load the data on the driver and sort it there > without using Spark :). Using reduce() is cheating, because it only works > as long as the data fits on one machine. That is not the targeted use case > of a distributed computation system. You can repeat your test with more > data (that doesn't fit on one machine) to see what I mean. > > On Tue, Jun 9, 2015 at 8:30 AM, raggy <raghav0110...@gmail.com> wrote: > >> For a research project, I tried sorting the elements in an RDD. I did >> this in >> two different approaches. >> >> In the first method, I applied a mapPartitions() function on the RDD, so >> that it would sort the contents of the RDD, and provide a result RDD that >> contains the sorted list as the only record in the RDD. Then, I applied a >> reduce function which basically merges sorted lists. >> >> I ran these experiments on an EC2 cluster containing 30 nodes. I set it up >> using the spark ec2 script. The data file was stored in HDFS. >> >> In the second approach I used the sortBy method in Spark. >> >> I performed these operation on the US census data(100MB) found here >> >> A single lines looks like this >> >> 9, Not in universe, 0, 0, Children, 0, Not in universe, Never married, Not >> in universe or children, Not in universe, White, All other, Female, Not in >> universe, Not in universe, Children or Armed Forces, 0, 0, 0, Nonfiler, >> Not >> in universe, Not in universe, Child <18 never marr not in subfamily, Child >> under 18 never married, 1758.14, Nonmover, Nonmover, Nonmover, Yes, Not in >> universe, 0, Both parents present, United-States, United-States, >> United-States, Native- Born in the United States, 0, Not in universe, 0, >> 0, >> 94, - 50000. >> I sorted based on the 25th value in the CSV. In this line that is 1758.14. >> >> I noticed that sortBy performs worse than the other method. Is this the >> expected scenario? If it is, why wouldn't the mapPartitions() and reduce() >> be the default sorting approach? >> >> Here is my implementation >> >> public static void sortBy(JavaSparkContext sc){ >> JavaRDD<String> rdd = sc.textFile("/data.txt",32); >> long start = System.currentTimeMillis(); >> rdd.sortBy(new Function<String, Double>(){ >> >> @Override >> public Double call(String v1) throws Exception { >> // TODO Auto-generated method stub >> String [] arr = v1.split(","); >> return Double.parseDouble(arr[24]); >> } >> }, true, 9).collect(); >> long end = System.currentTimeMillis(); >> System.out.println("SortBy: " + (end - start)); >> } >> >> public static void sortList(JavaSparkContext sc){ >> JavaRDD<String> rdd = sc.textFile("/data.txt",32); >> //parallelize(l, >> 8); >> long start = System.currentTimeMillis(); >> JavaRDD<LinkedList<Tuple2<Double, String>>> rdd3 = >> rdd.mapPartitions(new FlatMapFunction<Iterator<String>, >> LinkedList<Tuple2<Double, String>>>(){ >> >> @Override >> public Iterable<LinkedList<Tuple2<Double, String>>> >> call(Iterator<String> t) >> throws Exception { >> // TODO Auto-generated method stub >> LinkedList<Tuple2<Double, String>> lines = new >> LinkedList<Tuple2<Double, String>>(); >> while(t.hasNext()){ >> String s = t.next(); >> String arr1[] = s.split(","); >> Tuple2<Double, String> t1 = new Tuple2<Double, >> String>(Double.parseDouble(arr1[24]),s); >> lines.add(t1); >> } >> Collections.sort(lines, new IncomeComparator()); >> LinkedList<LinkedList<Tuple2<Double, String>>> list = new >> LinkedList<LinkedList<Tuple2<Double, String>>>(); >> list.add(lines); >> return list; >> } >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Different-Sorting-RDD-methods-in-Apache-Spark-tp23214.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >